Import and parse an xml file without FileOutputStream - java

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.

From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));

Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.

You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

Related

How to Parse Complete XML into a string using DOM PARSER?

For example:
If I pass tagValue=1 then it should return complete xml1 as a string to me String input is in some function.
String input = "<1><xml1></xml1></1><2><xml2></xml2><2>.......<10000><xml10000></xml10000></10000‌​>";
String output = "<xml1></xml1>"; // for tagValue=1;
If the xml has a root element then it is doable
1. Parse the XML using dom parser.
2. Iterate through each node
3. Find the desired node.
4. Write the node in a different xml using transform
Sample code
Step 1: I used XML as a string, you can read from file.
DocumentBuilder dBuilder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(uri));
Document doc = dBuilder.parse(is);
Iterate through each node
if (doc.hasChildNodes()) {
printNote(doc.getChildNodes(), doc);
}
Please put in your logic to iterate thorugh the nodes and find the right child node which you want to process.
Write back as xml. Here assumption is that tempNode is the one you want to write as XML.
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer;
try {
transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(tempNode);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
You are not parsing XML, you are parsing a String, a replace should be enough in your particular case
String input = "<xml1><note><to>Tove</to><from>Jani</from<heading>Reminder</heading><body>Don't forget me this weekend</body></note><xml1>";
input = input.replace("<xml1>", "");

Write Document to Internal Storage

I've used Java DOM to edit an XML template and now want to store the resulting Document in the android private internal storage, as per the official API guides.
So far I have:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
doc = dBuilder.parse(ThisApplication.resources().openRawResource(R.raw.default_store));
// Populate document here.
//Convert document to byte[]
Source source = new DOMSource(doc);
ByteArrayOutputStream out = new ByteArrayOutputStream();
StreamResult result = new StreamResult(out);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
transformer.transform(source, result); /// transformer is null!!!!!!!
// Store byte[] in internal storage
FileOutputStream fos = openFileOutput("data_store", Context.MODE_PRIVATE);
fos.write(out.toByteArray());
fos.close();
At the moment, I am getting a null pointer exception trying to call tranform() on transformer. The TransformerFactory API says that newTransformer(); can never return null, but apparently it's also platform dependent and in my case is returning null.
So, the question is how else can I either;
A) convert a Document object into a byte[] or
B) find another way to save a document to internal storage?
Edit: Android bug report filed.
As JDOM Document is Serializable, you can just write a Serializable object
I would do it something like:
ObjectOutputStream out = new ObjectOutputStream(fos);
out.writeObject(document);
out.close();

reading and updating a large xml file in java

I have an XML file about 400 MB
I need to find a specific element and then reformat its date attribute from mm-dd-yyyy to dd-mm-yyyy
Here is the code that I am using
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputXML);
doc.getDocumentElement().normalize();
//format the date
NodeList nodes = doc.getElementsByTagName("empDetails");
for (int i = 0; i < nodes.getLength(); i++){
String oldDate =nodes.item(i).getAttributes().getNamedItem("doj").getNodeValue();
String newValue = //formatted to dd-mm-yyyy
nodes.item(i).getAttributes().getNamedItem("doj").setTextContent(newValue);
}
//now write back to file
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer;
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(fileName));
transformer.transform(source, result);
However this is throwing out of memory
On windows 32 bit - it fails
So I tried this on a unix box and set the memory to :
java -Xmx3072m -classpath . MyTest
It did run for some time but failed again
Question - is it possible to be handling a file of 400 MB where I want to selectivey update and save? ( am sure the answer is yes )
Is my code bad - anything that I should change ? ( no unix shell scripts as an alternate solution please - my intent is to use java )
should I be bumping up the heap size further ?
Thanks,
satish
It would probably be better to use the StAX api read the document like a stream while writing out (again using StAX) the parts you don't want to change immediately to a a temporary file. When you get to a part you are interested in, change the values before feeding it back to the temporary file. When you are done you can rename the temporary file over the old one.
I'd recommend the XMLEventReader and XMLEventWriter. XMLEvents you don't care about you can pass directly through from reader to writer. This will only keep small parts of the document you are working on in memory.
XMLEventReader reader = ...;
XMLEventWriter writer = ...;
XMLEvent cursor;
while(reader.hasNext()){
cursor = reader.nextEvent();
if(doICareAboutThisEvent(cursor)){
writer.add(changeEvent(cursor));
}else{
writer.add(cursor);
}
}
Obviously the implementation can be more complicated and your decisions about which elements to care about and edit can be more complicated than the state of a single element. This is just a very simple example.

Bad Characters when parsing GML in Java

I'm using the org.w3c.dom package to parse the gml schemas (http://schemas.opengis.net/gml/3.1.0/base/).
When I parse the gmlBase.xsd schema and then save it back out, the quote characters around GeometryCollections in the BagType complex type come out converted to bad characters (See code below).
Is there something wrong with how I'm parsing or saving the xml, or is there something in the schema that is off?
Thanks,
Curtis
public static void main(String[] args) throws IOException
{
File schemaFile = File.createTempFile("gml_", ".xsd");
FileUtils.writeStringToFile(schemaFile, getSchema(new URL("http://schemas.opengis.net/gml/3.1.0/base/gmlBase.xsd")));
System.out.println("wrote file: " + schemaFile.getAbsolutePath());
}
public static String getSchema(URL schemaURL)
{
try
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(IOUtils.toString(schemaURL.openStream()))));
Element rootElem = doc.getDocumentElement();
rootElem.normalize();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream xmlOutStream = new ByteArrayOutputStream();
StreamResult result = new StreamResult(xmlOutStream);
transformer.transform(source, result);
return xmlOutStream.toString();
}
catch (Exception e)
{
e.printStackTrace();
}
return "";
}
I'm suspicious of this line:
Document doc = db.parse(new InputSource(
new StringReader(IOUtils.toString(schemaURL.openStream()))));
I don't know what IOUtils.toString does here but presumably it's assuming a particular encoding, without taking account of the XML declaration.
Why not just use:
Document doc = db.parse(schemaURL.openStream());
Likewise your FileUtils.writeStringToFile doesn't appear to specify a character encoding... which encoding does it use, and why encoding is in the StreamResult?

Java, XML DocumentBuilder - setting the encoding when parsing

I'm trying to save a tree (extends JTree) which holds an XML document to a DOM Object having changed it's structure.
I have created a new document object, traversed the tree to retrieve the contents successfully (including the original encoding of the XML document), and now have a ByteArrayInputStream which has the tree contents (XML document) with the correct encoding.
The problem is when I parse the ByteArrayInputStream the encoding is changed to UTF-8 (in the XML document) automatically.
Is there a way to prevent this and use the correct encoding as provided in the ByteArrayInputStream.
It's also worth adding that I have already used the
transformer.setOutputProperty(OutputKeys.ENCODING, encoding) method to retrieve the right encoding.
Any help would be appreciated.
Here's an updated answer since OutputFormat is deprecated :
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(document), new StreamResult(writer));
String output = writer.getBuffer().toString().replaceAll("\n|\r", "");
The second part will return the XML Document as String
// Read XML
String xml = "xml"
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
// Append formatting
OutputFormat format = new OutputFormat(document);
if (document.getXmlEncoding() != null) {
format.setEncoding(document.getXmlEncoding());
}
format.setLineWidth(100);
format.setIndenting(true);
format.setIndent(5);
Writer out = new StringWriter();
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document);
String result = out.toString();
I solved it, given alot of trial and errors.
I was using
OutputFormat format = new OutputFormat(document);
but changed it to
OutputFormat format = new OutputFormat(d, encoding, true);
and this solved my problem.
encoding is what I set it to be
true refers to whether or not indent is set.
Note to self - read more carefully - I had looked at the javadoc hours ago - if only I'd have read more carefully.
This worked for me and is very simple. No need for a transformer or output formatter:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(inputStream);
is.setEncoding("ISO-8859-1"); // set your encoding here
Document document = builder.parse(is);

Categories