Java, XML DocumentBuilder - setting the encoding when parsing - java

I'm trying to save a tree (extends JTree) which holds an XML document to a DOM Object having changed it's structure.
I have created a new document object, traversed the tree to retrieve the contents successfully (including the original encoding of the XML document), and now have a ByteArrayInputStream which has the tree contents (XML document) with the correct encoding.
The problem is when I parse the ByteArrayInputStream the encoding is changed to UTF-8 (in the XML document) automatically.
Is there a way to prevent this and use the correct encoding as provided in the ByteArrayInputStream.
It's also worth adding that I have already used the
transformer.setOutputProperty(OutputKeys.ENCODING, encoding) method to retrieve the right encoding.
Any help would be appreciated.

Here's an updated answer since OutputFormat is deprecated :
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(document), new StreamResult(writer));
String output = writer.getBuffer().toString().replaceAll("\n|\r", "");
The second part will return the XML Document as String

// Read XML
String xml = "xml"
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
// Append formatting
OutputFormat format = new OutputFormat(document);
if (document.getXmlEncoding() != null) {
format.setEncoding(document.getXmlEncoding());
}
format.setLineWidth(100);
format.setIndenting(true);
format.setIndent(5);
Writer out = new StringWriter();
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document);
String result = out.toString();

I solved it, given alot of trial and errors.
I was using
OutputFormat format = new OutputFormat(document);
but changed it to
OutputFormat format = new OutputFormat(d, encoding, true);
and this solved my problem.
encoding is what I set it to be
true refers to whether or not indent is set.
Note to self - read more carefully - I had looked at the javadoc hours ago - if only I'd have read more carefully.

This worked for me and is very simple. No need for a transformer or output formatter:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(inputStream);
is.setEncoding("ISO-8859-1"); // set your encoding here
Document document = builder.parse(is);

Related

Import and parse an xml file without FileOutputStream

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.
From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));
Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.
You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

Write Document to Internal Storage

I've used Java DOM to edit an XML template and now want to store the resulting Document in the android private internal storage, as per the official API guides.
So far I have:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
doc = dBuilder.parse(ThisApplication.resources().openRawResource(R.raw.default_store));
// Populate document here.
//Convert document to byte[]
Source source = new DOMSource(doc);
ByteArrayOutputStream out = new ByteArrayOutputStream();
StreamResult result = new StreamResult(out);
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
transformer.transform(source, result); /// transformer is null!!!!!!!
// Store byte[] in internal storage
FileOutputStream fos = openFileOutput("data_store", Context.MODE_PRIVATE);
fos.write(out.toByteArray());
fos.close();
At the moment, I am getting a null pointer exception trying to call tranform() on transformer. The TransformerFactory API says that newTransformer(); can never return null, but apparently it's also platform dependent and in my case is returning null.
So, the question is how else can I either;
A) convert a Document object into a byte[] or
B) find another way to save a document to internal storage?
Edit: Android bug report filed.
As JDOM Document is Serializable, you can just write a Serializable object
I would do it something like:
ObjectOutputStream out = new ObjectOutputStream(fos);
out.writeObject(document);
out.close();

Which is the best way to write a XML Document to a file in java?

I am trying to write an XML file. I was able to create the Document using the following code. I want to write this Document to a file with indent support. Currently my code looks like this.
Which is a better technology to parse XMl and write to a file.
public void writeXmlToFile(Document dom) throws IOException {
OutputFormat format = new OutputFormat(dom);
format.setIndenting(true);
XMLSerializer serializer = new XMLSerializer ( new FileOutputStream(
new File("sample.xml")), format);
serializer.serialize(dom);
}
or is using transformer a better approach.
public void writeXMLToFile(DOcument dom) throws TransformerException, IOException {
TransformerFactory transFact = TransformerFactory.newInstance();
Transformer trans = transFact.newTransformer();
trans.setOutputProperty(OutputKeys.ENCODING, "utf-8");
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
trans.setOutputProeprty("{http://xml.apache.org/xslt}indent-amount", "2");
StreamResult resut = new StreamResult(new FileWriter(output));
DOMSource source = new DOMSource(xmlDOC);
trans.transform(source, result);
writer.close();
}
What is the difference between the two approaches? And which of these techniques provide better performance?
To answer your question, I would suggest a third way which is the W3C proposed DOM Load and Save API. The code is self-explaining.
DOMImplementationLS ls = (DOMImplementationLS)
DOMImplementationRegistry.newInstance().getDOMImplementation("LS");
// Gets a basic document from string.
LSInput input = ls.createLSInput();
String xml = "<bookstore city='shanghai'><a></a><b/></bookstore>";
InputStream istream = new ByteArrayInputStream(xml.getBytes("UTF-8"));
input.setByteStream(istream);
LSParser parser = ls.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = parser.parse(input);
// Creates a LSSerializer object and saves to file.
LSSerializer serializer = ls.createLSSerializer();
serializer.getDomConfig().setParameter("format-pretty-print", true);
LSOutput output = ls.createLSOutput();
OutputStream ostream = new FileOutputStream("c:\\temp\\foo.xml");
output.setByteStream(ostream);
serializer.write(document, output);
Unlike XmlSerializer which is more or less a pre-standard, this approach is preferred as it is supported by all compliant implementations. The performance largely depends on vendor implementation though.

Bad Characters when parsing GML in Java

I'm using the org.w3c.dom package to parse the gml schemas (http://schemas.opengis.net/gml/3.1.0/base/).
When I parse the gmlBase.xsd schema and then save it back out, the quote characters around GeometryCollections in the BagType complex type come out converted to bad characters (See code below).
Is there something wrong with how I'm parsing or saving the xml, or is there something in the schema that is off?
Thanks,
Curtis
public static void main(String[] args) throws IOException
{
File schemaFile = File.createTempFile("gml_", ".xsd");
FileUtils.writeStringToFile(schemaFile, getSchema(new URL("http://schemas.opengis.net/gml/3.1.0/base/gmlBase.xsd")));
System.out.println("wrote file: " + schemaFile.getAbsolutePath());
}
public static String getSchema(URL schemaURL)
{
try
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(IOUtils.toString(schemaURL.openStream()))));
Element rootElem = doc.getDocumentElement();
rootElem.normalize();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream xmlOutStream = new ByteArrayOutputStream();
StreamResult result = new StreamResult(xmlOutStream);
transformer.transform(source, result);
return xmlOutStream.toString();
}
catch (Exception e)
{
e.printStackTrace();
}
return "";
}
I'm suspicious of this line:
Document doc = db.parse(new InputSource(
new StringReader(IOUtils.toString(schemaURL.openStream()))));
I don't know what IOUtils.toString does here but presumably it's assuming a particular encoding, without taking account of the XML declaration.
Why not just use:
Document doc = db.parse(schemaURL.openStream());
Likewise your FileUtils.writeStringToFile doesn't appear to specify a character encoding... which encoding does it use, and why encoding is in the StreamResult?

Categories