I have an XML file about 400 MB
I need to find a specific element and then reformat its date attribute from mm-dd-yyyy to dd-mm-yyyy
Here is the code that I am using
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputXML);
doc.getDocumentElement().normalize();
//format the date
NodeList nodes = doc.getElementsByTagName("empDetails");
for (int i = 0; i < nodes.getLength(); i++){
String oldDate =nodes.item(i).getAttributes().getNamedItem("doj").getNodeValue();
String newValue = //formatted to dd-mm-yyyy
nodes.item(i).getAttributes().getNamedItem("doj").setTextContent(newValue);
}
//now write back to file
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer;
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(fileName));
transformer.transform(source, result);
However this is throwing out of memory
On windows 32 bit - it fails
So I tried this on a unix box and set the memory to :
java -Xmx3072m -classpath . MyTest
It did run for some time but failed again
Question - is it possible to be handling a file of 400 MB where I want to selectivey update and save? ( am sure the answer is yes )
Is my code bad - anything that I should change ? ( no unix shell scripts as an alternate solution please - my intent is to use java )
should I be bumping up the heap size further ?
Thanks,
satish
It would probably be better to use the StAX api read the document like a stream while writing out (again using StAX) the parts you don't want to change immediately to a a temporary file. When you get to a part you are interested in, change the values before feeding it back to the temporary file. When you are done you can rename the temporary file over the old one.
I'd recommend the XMLEventReader and XMLEventWriter. XMLEvents you don't care about you can pass directly through from reader to writer. This will only keep small parts of the document you are working on in memory.
XMLEventReader reader = ...;
XMLEventWriter writer = ...;
XMLEvent cursor;
while(reader.hasNext()){
cursor = reader.nextEvent();
if(doICareAboutThisEvent(cursor)){
writer.add(changeEvent(cursor));
}else{
writer.add(cursor);
}
}
Obviously the implementation can be more complicated and your decisions about which elements to care about and edit can be more complicated than the state of a single element. This is just a very simple example.
Related
i'm trying to write the header for an xml file so it would be something like this:
<file xmlns="http://my_namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://my_namespace file.xsd">
however, I can't seem to find how to do it using the Document class in java. This is what I have:
public void exportToXML() {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.newDocument();
doc.setXmlStandalone(true);
doc.createTextNode("<file xmlns=\"http://my_namespace"\n" +
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" +
"xsi:schemaLocation=\"http://my_namespace file.xsd\">");
Element mainRootElement = doc.createElement("MainRootElement");
doc.appendChild(mainRootElement);
for(int i = 0; i < tipoDadosParaExportar.length; i++) {
mainRootElement.appendChild(criarFilhos(doc, tipoDadosParaExportar[i]));
}
Transformer tr = TransformerFactory.newInstance().newTransformer();
tr.transform(new DOMSource(doc),
new StreamResult(new FileOutputStream(filename)));
} catch (Exception e) {
e.printStackTrace();
}
}
I tried writing it on the file using the createTextNode but it didn't work either, it only writes the version before showing the elements.
PrintStartXMLFile
Would appreciate if you could help me. Have a nice day
Your createTextNode() method is only suitable for creating text nodes, it's not suitable for creating elements. You need to use createElement() for this. If you're doing this by building a tree, then you need to build nodes, you can't write lexical markup.
I'm not sure what MainRootElement is supposed to be; you've only given a fragment of your desired output so it's hard to tell.
Creating a DOM tree and then serializing it is a pretty laborious way of constructing an XML file. Using something like an XMLEventWriter is easier. But to be honest, I got frustrated by all the existing approaches and wrote a new library for the purpose as part of Saxon 10. It's called simply "Push", and looks something like this:
Processor proc = new Processor();
Serializer serializer = proc.newSerializer(new File(fileName));
Push push = proc.newPush(serializer);
Document doc = push.document(true);
doc.setDefaultNamespace("http://my_namespace");
Element root = doc.element("root")
.attribute(new QName("xsi", "http://www.w3.org/2001/XMLSchema-instance", "schemaLocation"),
"http://my_namespace file.xsd");
doc.close();
For example:
If I pass tagValue=1 then it should return complete xml1 as a string to me String input is in some function.
String input = "<1><xml1></xml1></1><2><xml2></xml2><2>.......<10000><xml10000></xml10000></10000>";
String output = "<xml1></xml1>"; // for tagValue=1;
If the xml has a root element then it is doable
1. Parse the XML using dom parser.
2. Iterate through each node
3. Find the desired node.
4. Write the node in a different xml using transform
Sample code
Step 1: I used XML as a string, you can read from file.
DocumentBuilder dBuilder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(uri));
Document doc = dBuilder.parse(is);
Iterate through each node
if (doc.hasChildNodes()) {
printNote(doc.getChildNodes(), doc);
}
Please put in your logic to iterate thorugh the nodes and find the right child node which you want to process.
Write back as xml. Here assumption is that tempNode is the one you want to write as XML.
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer;
try {
transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(tempNode);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
You are not parsing XML, you are parsing a String, a replace should be enough in your particular case
String input = "<xml1><note><to>Tove</to><from>Jani</from<heading>Reminder</heading><body>Don't forget me this weekend</body></note><xml1>";
input = input.replace("<xml1>", "");
Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.
From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));
Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.
You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.
While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This — That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "—".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This — That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.
This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.