split a large xml file into multiple parts using java [duplicate] - java

This question already has answers here:
XML parsing java.lang.OutOfMemoryError: [memory exhausted]
(2 answers)
Closed last month.
I have an xml file and I want to manipulate the tags using the Java DOM, but its size is 25 gega-octets, so its telling me I can't and shows me this error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
public Frwiki() {
filePath = "D:\\compressed\\frwiki-latest-pages-articles.xml";
}
public void deletingTag() throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse(filePath);
NodeList nodes = doc.getElementsByTagName("*");
for (int j = 0; j < 3; j++) {
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
if (!node.getNodeName().equals("id") && !node.getNodeName().equals("title")
&& !node.getNodeName().equals("text") && !node.getNodeName().equals("mediawiki")
&& !node.getNodeName().equals("revision") && !node.getNodeName().equals("page"))
node.getParentNode().removeChild(node);
}
}
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(doc), new StreamResult(filePath));
}

You can split a large file into smaller files using XSLT 3.0 streaming, like this:
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template name="xsl:initial-template">
<xsl:source-document streamable="yes" href="frwiki-latest-pages-articles.xml">
<xsl:for-each-group ....>
<xsl:result-document href="......">
<part><xsl:copy-of select="current-group()"/></part>
</xsl:result-document>
</xsl:for-each-group>
</xsl:source-document>
</xsl:template>
</xsl:transform>
The "..." parts depend on how you want to split the document and name the result files.
Although XSLT 3.0 streaming is a W3C specification, the only implementation available at the moment is my company's Saxon-EE processor.

Split the large XML file into smaller chunks and process them separately.

Related

How can i escape special characters with using DOM

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.
I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.
Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,
for d'ex i get d&apos;ex but i need d&apos;ex
It makes sense since, DOM escapes "&". But obviously this is not what i need.
So if DOM is already capable of escaping "&" to "&" how can i make it escape other characters like " to " ?
If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?
This is how i escape the special characters i used apache StringEscapeUtils class:
public String xMLTransform() throws Exception
{
String xmlfile = FileUtils.readFileToString(new File(filepath));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));
NodeList nodeList = doc.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
Node child = currentNode.getFirstChild();
while(child != null) {
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.
}
child = child.getNextSibling();
}
}
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
FileOutputStream fop = null;
File file;
file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");
fop = new FileOutputStream(file);
String xmlString = writer.toString();
byte[] contentInBytes = xmlString.getBytes();
fop.write(contentInBytes);
fop.flush();
fop.close();
return file.getPath();
}
I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.
I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:
#Test
public void testXSLTTransforms () throws Exception {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element el = doc.createElement("Container");
doc.appendChild(el);
Text e = doc.createTextNode("Character");
el.appendChild(e);
//e.setNodeValue("\'");
//e.setNodeValue("\"");
e.setNodeValue("&");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(System.out);
//This prints the original document to the command line.
transformer.transform(source, result);
InputStream xsltStream = getClass().getResourceAsStream("/characterswap.xslt");
Source xslt = new StreamSource(xsltStream);
transformer = transformerFactory.newTransformer(xslt);
//This one is the one you'd pipe to a file
transformer.transform(source, result);
}
And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:
characterswap.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:text>
Original VALUE : </xsl:text>
<xsl:copy-of select="."/>
<xsl:text>
OUTPUT ESCAPING DISABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="yes"/>
<xsl:text>
OUTPUT ESCAPING ENABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="no"/>
</xsl:template>
</xsl:stylesheet>
And the console out is pretty basic:
<?xml version="1.0" encoding="UTF-8"?>
<Container>&</Container>
Original VALUE : <Container>&</Container>
OUTPUT ESCAPING DISABLED : &
OUTPUT ESCAPING ENABLED : &
You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.
XSLT string replace
is a good place to start.
This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.
Best of luck.
I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.
Along those lines, if we take your current node text transformation:
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}
And explicitly expect that we want the HTML Encoding:
if (child.getNodeType() == Node.TEXT_NODE) {
//Capture the current node value
String nodeValue = child.getNodeValue();
//Decode for XML10 to remove existing escapes
String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
//Then Re-encode for HTML (3/4/5)
String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);
//Then place the fully-encoded HTML back to the node
child.setNodeValue(fullyEncodedHTML);
}
I would think that the xml would now be fully encoded with all of the
HTML escapes you were wanting.
Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.
I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.
In theory, that seems like it would fulfill my understanding of your objective.
Caveat again is that I'm weak with XSLT, so the example xslt file may
still need some tweaking. This solution reduces that unknown work
quantity, in my opinion.
I've seen people use regex to do something similar
Copied from (Replace special character with an escape preceded special character in Java)
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
That whacky regex is a "look ahead" - a non capturing assertion that the following char match something - in this case a character class.
Notice how you don't need to escape chars in a character class, except a ] (even the minus don't need escaping if first or last).
The \\\\ is how you code a regex literal \ (escape once for java, once for regex)
Here's a test of this working:
public static void main(String[] args) {
String search = "code:xy";
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
System.out.println(newSearch);
}
Output:
code\:xy
this is very closely related to this question (how to Download a XML file from a URL by Escaping Special Characters like < > $amp; etc?).
This post has a similar case where the code downloads XML's with parsed / escaped content.
As i understand , you read file , parse it and escape characters . During saving the XML gets "escaped" again. While you can use the DOM for checking well-formed XML or schema, file based operations to escape can help you escape XML and HTML special characters. The code sample in the post refers to usage of IOUtils and StringUtils to do it. Hope this helps !
I would use StringEscapeUtils.escapeXml10()... details here. https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#ESCAPE_XML10

How to remove element from xml file using text content in java? [duplicate]

This question already has answers here:
remove elements from XML file in java
(5 answers)
Closed 7 years ago.
i have the following in xml file
<data>
<group>
<groupname>a</groupname>
<groupuser>Saw</groupuser>
<groupuser>John</groupuser>
</group>
<group>
<groupname>b</groupname>
<groupuser>John</groupuser>
<groupuser>Saw</groupuser>
</group>
<group>
<groupname>c</groupname>
<groupuser>John</groupuser>
<groupuser>Saw</groupuser>
</group>
<user>
<username>John</username>
<password>1234</password>
</user>
</data>
I am trying to remove this element
<groupuser>John</groupuser>
This is my method:
public void removeUserGroup(String username) {
try {
File fXmlFile = new File(filePath);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName("group");
for (int temp = 0; temp < nList.getLength(); temp++) {
Element group = (Element) nList.item(temp);
for (int temp2 = 0; temp2 < group.getElementsByTagName("groupuser").getLength(); temp2++) {
Element name = (Element) group.getChildNodes().item(temp2);
if (name.getTextContent().equals(username))
group.getParentNode().removeChild(group);
}
}
} catch (Exception ex) {
System.out.println("Database exception");
}
}
The code enters here and it doesn't throw exception
group.getParentNode().removeChild(group);
But nothing happens in the xml file!
I used this method from another question here i modified it to loop on children of <group> but doesn't seem to work
First you don't save the xml back to the file.
Second, this code is wrong although it might be working in certain circumstances:
for (int temp2 = 0; temp2 < group.getElementsByTagName("groupuser").getLength(); temp2++){
Save the result to NodeList instead and loop over the nodes in the list.
And third - you are removing full group not just groupuser.
After you have removed the element you'll have to save your output to some file to be able to see something:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(new DOMSource(doc), new StreamResult(new File("someFilePath")));
now you'll be able to see if you've deleted it correctly!
EDIT:
if you want to prettily print your file you'll have to add these lines before transforming:
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
This will create some nice indentention and make your file much more readable!

Edit a XML file in java

I have the following XML file:
<Tables>
<table>
<row></row>
</table>
<Tables>
and I want to edit it to :
<Tables>
<table>
<row>some value</row>
</table>
<Tables>
I write the XML file using file writer. How can I edit it?
What I was found that I create a temp file contains edits then delete the original file and rename the temp file. Is there any other way?
that's my code to write the file:
public boolean createTable(String path, String name, String[] properties) throws IOException {
FileWriter writer = new FileWriter(path);
writer.write("<Tables>");
writer.write("\t<" + name + ">");
for(int i=0; i<properties.length; i++){
writer.write("\t\t<" + properties[0] + "></" + properties[0] + ">");
}
writer.write("\t</" + name + ">");
writer.write("</Tables>");
writer.close();
return false;
}
Don't read and write XML yourself. Java comes with multiple API's for parsing and generating XML, which takes care of all the encoding and escaping issues for you:
DOM XML is loaded into memory in a tree structure.
SAX XML is processed as a sequence of events. This is a push-parser, where the parser calls your code for each event.
StAX XML is read as a sequence of events/tokens. This is a pull-parser, where your code calls the parser to get next value.
You can also find many third-party libraries for parsing XML, and Java itself also supports marshalling of XML to POJO's.
In your case I'd suggest DOM, since it's easiest to use. Don't use DOM for huge XML files, since it loads the entire file into memory. For huge files, I'd suggest StAX.
Other than encoding issues, using an XML parser will make the code less susceptible to minor variations in the input, e.g. the 3 empty row elements below all mean the same. Or is the row element even empty, and how to get rid of existing content like shown:
<!-- row is empty -->
<row></row>
<row/>
<row />
<!-- row has content -->
<row>5 + 7 < 10</row>
<row><![CDATA[5 + 7 < 10]]></row>
<row><condition expr="5 + 7 < 10"></row>
Using DOM:
// Load XML from file
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document document = domBuilder.parse(file);
// Modify DOM tree (simple version)
NodeList rowNodes = document.getElementsByTagName("row");
for (int i = 0; i < rowNodes.getLength(); i++) {
Node rowNode = rowNodes.item(i);
// Remove existing content (if any)
while (rowNode.getFirstChild() != null)
rowNode.removeChild(rowNode.getFirstChild());
// Add text content
rowNode.appendChild(document.createTextNode("some value"));
}
// Save XML to file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(document),
new StreamResult(file));
if your xml is static you can use this, here input.xml is your xml file
File file = new File("input.xml");
byte[] data;
try (FileInputStream fis = new FileInputStream(file)) {
data = new byte[(int) file.length()];
fis.read(data);
}
String input = new String(data, "UTF-8");
String tag = "<row>";
String newXML = input.substring(0, input.indexOf(tag) + tag.length()) + "your value" + input.substring(input.indexOf(tag) + tag.length(), input.length());
try (FileWriter fw = new FileWriter(file)) {
fw.write(newXML);
}
System.out.println("XML Updated");

Java/Android: Parsing XML to get all the xml in a certain tag

I'm programming in java (and ultimately in Android) and I have a set up like this
<A>
<B>
<C>stuff</C>
<D>
<E>other stuff</E>
<F>more stuff</F>
</D>
</B>
<B>
<C>stuff</C>
</B>
<B>
<C>some stuff</C>
<D>
<E>basic stuff</E>
<F>even more stuff</F>
</D>
</B>
</A>
I want to parse it so that we get (amongst other things which I've already coded) all the things in both D's so we'd get Strings that look like
<E>other stuff</E>
<F>more stuff</F>
an empty string ("") and
<E>basic stuff</E>
<F>even more stuff</F>
The parser I've been using stops as soon as it hits a lesser than symbol '<', so it's been giving me nothing. Is there a way to parse it the way I described in Java?
EDIT: I just converted it to a String and used regular expressions.
You need to use a parser that's already written.
Don't use one that you've rolled yourself, you're just asking to make a problem for yourself.
To turn parsed XML back into a string, you can use a javax.xml.transform.Transformer. I've attached code which parses your example XML and prints all D elements to the console - I think you'll be able to turn this into what you want :)
// The below is simply to create a document to test the code with
String xml = "<A><B><C>stuff</C><D><E>other stuff</E><F>more stuff</F></D></B><B><C>stuff</C></B><B><C>some stuff</C><D><E>basic stuff</E><F>even more stuff</F></D></B></A>";
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource docSource = new InputSource(new StringReader(xml));
Document document = documentBuilder.parse(docSource);
// The above is simply to create a document to test the code with
// Transformer takes a DOMSource pointed at a Node and outputs it as text
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Add new lines for every element
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
// Skip the <? xml ... ?> prolog
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
NodeList elements = document.getElementsByTagName("D");
StringWriter sw = new StringWriter();
StreamResult res = new StreamResult(sw);
DOMSource source = new DOMSource();
for (int i = 0; i < elements.getLength(); i++) {
Element element = (Element) elements.item(i);
source.setNode(element);
// Write the current element to the stringwriter via the streamresult
transformer.transform(source, res);
}
System.out.println(sw.toString());
If you only want the contents of the elements, you can replace the for loop like this:
for (int i = 0; i < elements.getLength(); i++) {
Element element = (Element) elements.item(i);
NodeList childNodes = element.getChildNodes();
for (int j = 0; j < childNodes.getLength(); j++) {
Node childNode = childNodes.item(j);
source.setNode(childNode);
transformer.transform(source, res);
}
}

Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.
In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?
Thanks in advance, included below is the Class definition from Oracles website.
Class DocumentBuilderFactory
"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "
The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:
http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)
But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.
Alternatively you could programmatically remove the extra whitespace yourself using something like the following:
public static void removeEmptyTextNodes(Node node) {
NodeList nodeList = node.getChildNodes();
Node childNode;
for (int x = nodeList.getLength() - 1; x >= 0; x--) {
childNode = nodeList.item(x);
if (childNode.getNodeType() == Node.TEXT_NODE) {
if (childNode.getNodeValue().trim().equals("")) {
node.removeChild(childNode);
}
} else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
removeEmptyTextNodes(childNode);
}
}
}
It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.
Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.
The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:
StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputStream xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();
NodeList nodes = documentRoot.getChildNodes();
System.out.println("How many children does the root have? => "nodes.getLength());
for(int index = 0; index < nodes.getLength(); index++){
System.out.println(nodes.item(index).getLocalName());
}
Output:
How many children does the root have? => 4
null
A
null
B
But if the new newlineChar is removed from the StringBuilder,
the ouptput is:
How many children does the root have? => 2
A
B
This demonstrates that the DOM objects generated by DocumentBuilder are different.
There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.
It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.
But sadly i do neigther remember why that error occured nor which parser I took.

Categories