Not able to parse XML file containing Chinese content - java

I have an XML file containing Chinese content. But while displaying I am getting question marks. Could somebody look into this issue?
My book.xml :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<person>
<first>密码</first>
<last>Pai</last>
<age>22</age>
</person>
</book>
And my code is:
public static void main (String argv []){
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File("book.xml"));
String strDoc=getStringFromDocument(doc);
System.out.println(strDoc);
}
public static String getStringFromDocument(Document doc) {
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
String xmlString = sw.toString();
return xmlString.toString();
}
After that I am getting ??:
<?xml version="1.0" encoding="UTF-8"?>
<book>
<person>
<first>??</first>
<last>Pai</last>
<age>22</age>
</person>

Your code runs fine on my system. I was able to create a books.xml with chinese characters, run your code on my system and get the correct output.
[update]
Previously I thought your books.xml file was suspect - but I was finally able to reproduce your problem on my system by setting -Dfile.encoding=ISO-8859-1.
Somewhere in your environment you have an incorrect character encoding setting. Perhaps in the JVM, perhaps in the console where you are displaying the characters.
One way to ensure that you are writing your String as a UTF-8 encoded byte stream is to change:
System.out.println(strDoc);
to
System.out.write(strDoc.getBytes("UTF-8"));
This may or may not fix what you are seeing on the screen. Your console must also be configured to properly handle UTF-8 encoded data. But if you write these bytes to a file or socket, you should be able confirm that the bytes match those in your original file.

Related

xquery transformation creates empty namespace in element

I'm sorry but I guess I just don't see the mistake I'm making here.
I have a camel route which returns an XML and to be able to test the output I wrote a JUnit Test which runs with SpringRunner. There I get the XML Stream from the exchange which I validate against an XSD. This works great because the XSD throws an exception because the output XML is not valid, but I don't understand why the following xquery generates an element with EMPTY NAMESPACE?
See the xquery snippet (I'm sorry again I cannot provide more code):
declare default element namespace "http://www.dppgroup.com/XXXPMS";
let $cmmdoc := $doc/*:cmmdoc
, $partner := $doc/*:cmmdoc/*:information/*:partner_gruppe/*:partner
, $sequence:= fn:substring($cmmdoc/#unifier,3)
return <ClientMMS xmlns:infra="http://www.dppgroup.com/InfraNS">
{
for $x in $partner
where $x[#partnerStatusCode = " "]
return
element {"DataGroup" } {
<Client sequenceNumber="{$sequence}" />
}
}
My problem is, that with this code the resulting XML contains the DataGroup-element with the following namespace definition:
<?xml version="1.0" encoding="UTF-8"?>
<ClientMMS xmlns="http://www.dppgroup.com/XXXPMS"
xmlns:infra="http://www.dppgroup.com/InfraNS">
<DataGroup xmlns="">
<Client sequenceNumber="170908065609671475"/>
</DataGroup>
</ClientMMS>
The snippet from the Unit-Test: I'm using jdk1.8_102
String xml = TestDataReader.readXML("/input/info/info_in.xml", PROJECT_ENCODING);
quelle.sendBody(xml);
boolean valid = false;
try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream((byte[]) archiv.getExchanges().get(1).getIn().getBody());
Document document = documentBuilder.parse(byteArrayInputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(document);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
In no XQuery introduction/tutorial/explanation I can find a reason why this happens. Can you guys please explain why the DataGroup element is not in the default namespace?
The XQuery you posted should create the result fine without the namespace undeclaration you show.
In your Java code if you want to work with XML with namespaces make sure you use a namespace aware DocumentBuilder, as the default DocumentBuilderFactory is not namespace aware make sure you set setNamespaceAware(true) on the factory before creating a DocumentBuilder with it.

Replace text in XML before XSLT

I need to replace a certain text in a XML file before giving it to the XSL-Transformer.
It's the DTD-URL in the DOCTYPE tag. It points to a webserver, but I want it to be usable offline, so I want to change it to a URL pointing to a local file.
However I mustn't edit the original XML directly. I thought of reading the file into a string, use String.replaceAll() on the text and save it into another file, which I pass to the Transformer. I already tried it, but it's really slow; the file I'm using has a size of ca. 500kiB.
Is there any better (=faster) way to accomplish this?
EDIT: The code used for the transformation:
public String getPlaylist(String playlist) {
Source source = new StreamSource(library);
StreamSource xsl = new StreamSource(getClass().getResourceAsStream("M3Utransformation.xml"));
StringWriter w = new StringWriter();
Result result = new StreamResult(w);
try {
Transformer transformer = TransformerFactory.newInstance().newTransformer(xsl);
transformer.setParameter("playlist", playlist);
transformer.transform(source, result);
return w.getBuffer().toString();
} catch (Throwable t) {
t.printStackTrace();
return null;
}
}
You can create an entity resolver, and make use of it.
The following example uses the JAXP DocumentBuilder, and a CatalogResolver
public static void main(String[] args) throws ParserConfigurationException,
SAXException, IOException, TransformerConfigurationException, TransformerException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new CatalogResolver());
File src = new File("src/article.xml");
Document doc = db.parse(src);
// Here, we execute the transformation
// Use a Transformer for output
File stylesheet = new File("src/aticle.xsl");
TransformerFactory tFactory = TransformerFactory.newInstance();
StreamSource stylesource = new StreamSource(stylesheet);
Transformer transformer = tFactory.newTransformer(stylesource);
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
create a catalog properties file, and place it on your classpath
CatalogManager.properties has to be the name, see CatalogManager API documentation
define a catalog XML file, point your properties file, above to it. From
http://www.xml.com/pub/a/2004/03/03/catalogs.html you can find a very simple catalog XML file :
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OYRM/foo" uri="src/bar.dtd"/>
</catalog>
With the above catalog.xml and CatalogManager.properties, you'll end up resolving references to the publicId "-//OYRM/foo" to the uri src/bar.dtd
xml-commons contains the resolver :
http://xerces.apache.org/mirrors.cgi#binary
for a more complete treatment of the topic of Resolvers read Tom White's article from XML.com
The transformer application was cribbed from the Java trail for Extensible StyleSheet Language Transformations > Transforming Data with XSLT

Output format of XML empty element in Java

I wrote code below to get XML output.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.newDocument();
Element element = document.createElement("Test");
Text text = document.createTextNode("");
element.appendChild(text);
document.appendChild(element);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
What I got is
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Test/>
What I want to get is
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Test></Test>
How can I do this?
Many thanks.
There is no clean way to do this..
If you feel comfortable to use duct-tape solutions, you could let your transformer output html instead of xml:
transformer.setOutputProperty(javax.xml.transform.OutputKeys.METHOD, "html");
But again, I have to point out that this is not a clean solution, but it did the trick for me as I was stuck with a similar problem

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

How to disable/avoid Ampersand-Escaping in Java-XML?

I want to create a XML where blanks are replaced by  . But the Java-Transformer escapes the Ampersand, so that the output is &#160;
Here is my sample code:
public class Test {
public static void main(String[] args) {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();
Element element = document.createElement("element");
element.setTextContent(" ");
document.appendChild(element);
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult streamResult = new StreamResult(stream);
transformer.transform(new DOMSource(document), streamResult);
System.out.println(stream.toString());
}
}
And this is the output of my sample code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<element>&#160;</element>
Any ideas to fix or avoid that? thanks a lot!
Use output escaping as follows:
Node disableEscaping = document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, "&");
Element element = document.createElement("element");
element.setTextContent(" ");
document.appendChild(disableEscaping );
document.appendChild(element);
Node enableEscaping = document.createProcessingInstruction(StreamResult.PI_ENABLE_OUTPUT_ESCAPING, "&");
document.appendChild(enableEscaping )
Set the text content directly to the character you want, and the serializer will escape it for you if necessary:
element.setTextContent("\u00A0");
Try to use
element.appendChild (document.createCDATASection (" "));
instead of
element.setTextContent(...);
You'll get this in your xml:
It may work if I understand correctly what you're trying to do.
As addon to forty-two's answer:
If, like me, you're trying the code in a non-patched Eclipse IDE, you're likely to see some weird A's appearing instead of the non-breaking space. This is because of the encoding of the console in Eclipse not matching Unicode (UTF-8).
Adding -Dfile.encoding=UTF-8 to your eclipse.ini should solve this.
Cheers,
Wim

Categories