Java Transformer converts Chinese character to ASCII value [duplicate] - java

This question already has answers here:
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") is NOT working
(8 answers)
Closed 3 years ago.
Ok after lot of search I decided to ask question here. Below is the sample code to reproduce my problem. The document object is build with chinese character.
String value= "𧀠";
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");
root.setAttribute("attribute", value);
doc.appendChild(root);
DOMSource source = new DOMSource(doc);
I am trying to convert the document source to string using the Transformer class with the below code.
ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(source, htmlStreamResult);
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-8" );
But I got output with converted Chinese characters as below.
<?xml version="1.0" encoding="UTF-8" standalone="no"?><value attribute="𧀠"/>
I do not want the Chinese character to be converted but to be displayed as it is. Appreciate if anyone help me on this.

Change UTF-8 to UTF-16. Since you're making a String (which is code-page agnostic) this has no ill effect on the encoding. This however adds code-page declaration and sometimes a BOM (Byte-Order-Mark) in the XML header. You can optionally leave the header out and attach your own.
String value= "𧀠かな〜"; // (I don't see your character so I added some of my own)
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");
root.setAttribute("attribute", value);
doc.appendChild(root);
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
// transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); // optional
transformer.transform(source, htmlStreamResult);
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-16" );
System.out.println(outPut);
Output:
<?xml version="1.0" encoding="UTF-16" standalone="no"?><value attribute="𧀠かな〜"/>

Related

Converting emoji to HTML Decimal Code or Unicode Hexadecimal Code in java

I am trying to convert text file with emoji content to the file with emoji's html code or Hex code using Java.
example :
I/p : <div id="thread" style="white-space: pre-wrap;"><div>😀😀😃🍎🍏⚽️🏀
Expected o/p :<div id="thread" style="white-space: pre-wrap;"><div>😀😀😃🍎🍏⚽️🏀
In above out put '😃' should get changed to the corresponding html entity code'& # 128512;'
Detail of Html entity code and hex code is given here :
http://character-code.com/emoticons-html-codes.php
Sample code that i tried is below :
try {
File file = new File("/inFile.txt");
str = FileUtils.readFileToString(file, "ISO-8859-1");
System.out.println(new String(str.getBytes(), "UTF-8"));
String results = StringEscapeUtils.escapeHtml4(str);
System.out.println(results);
} catch (IOException e) {
e.printStackTrace();
}
I got the work around :
public static void htmlDecimalCodeGenerator () {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(false);
// File inputFile = new File("/inputFile.xml");
File inputFile = new File("/inputFile.xml");
try {
FileOutputStream fop = null;
File OutFile = new File("/outputFile.xml");
fop = new FileOutputStream(OutFile);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(inputFile);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
/*
no value of OMIT_XML_DECLARATION will add following xml declaration in the beginning of the file.
<?xml version='1.0' encoding='UTF-32'?>
*/
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
/*
When the output method is "xml", the version value specifies the
version of XML to be used for outputting the result tree. The default
value for the xml output method is 1.0. When the output method is
"html", the version value indicates the version of the HTML.
The default value for the xml output method is 4.0, which specifies
that the result should be output as HTML conforming to the HTML 4.0
Recommendation [HTML]. If the output method is "text", the version
property is ignored
*/
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
/*
Indent-- specifies whether the Transformer may
add additional whitespace when outputting the result tree; the value
must be yes or no.
*/
transformer.setOutputProperty(OutputKeys.INDENT, "no");
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
// transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(new DOMSource(doc),
new StreamResult(new OutputStreamWriter(System.out, "UTF-8")));
// new StreamResult(new OutputStreamWriter(fop, "UTF-8")));
} catch (Exception e) {
e.printStackTrace();
}
}
}

Java Transformer outputs < and > instead of <>

I am editing an XML file in Java with a Transformer by adding more nodes. The old XML code is unchanged but the new XML nodes have < and > instead of <> and are on the same line. How do I get <> instead of < and > and how do I get line breaks after the new nodes. I already read several similar threads but wasn't able to get the right formatting. Here is the relevant portion of the code:
// Read the XML file
DocumentBuilderFactory dbf= DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc=db.parse(xmlFile.getAbsoluteFile());
Element root = doc.getDocumentElement();
// create a new node
Element newNode = doc.createElement("Item");
// add it to the root node
root.appendChild(newNode);
// create a new attribute
Attr attribute = doc.createAttribute("Name");
// assign the attribute a value
attribute.setValue("Test...");
// add the attribute to the new node
newNode.setAttributeNode(attribute);
// transform the XML
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
StreamResult result = new StreamResult(new FileWriter(xmlFile.getAbsoluteFile()));
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
Thanks
To replace the &gt and other tags you can use org.apache.commons.lang3:
StringEscapeUtils.unescapeXml(resp.toString());
After that you can use the following property of transformer for having line breaks in your xml:
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
based on a question posted here:
public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
fDoc.setXmlStandalone(true);
DOMSource docSource = new DOMSource(fDoc);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "no");
transformer.transform(docSource, new StreamResult(out));
}
produces:
<?xml version="1.0" encoding="UTF-8"?>
The differences I see:
fDoc.setXmlStandalone(true);
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
Try passing InputStream instead of Writer to StreamResult.
StreamResult result = new StreamResult(new FileInputStream(xmlFile.getAbsoluteFile()));
The Transformer documentation also suggests that.

Create namespace in dom element issue

Im trying to create XML document with DOM APi's and when i use the following code I got the
expect result
Element rootTreeNode = document.createElementNS("http://schemas.microsoft.com/ado/2007","ex" + ":Ex")
this is the output with tags in output console
ex:Ex Version="1.0" xmlns:ex="http://schemas.microsoft.com/ado/2007"/
Now I want to add to this element the following
**xmlns:gp**="http://www.pst.com/Protocols/Data/Generic"
and I dont succeed with the xmlns:gp i have tried to use
the like the following
rootTreeNode.setAttributeNS("xmlns" ,"gp","http://www.pst.com/Protocols/Data/Generic")
and i have got it like the folloing
**xmlns:ns0="xmlns"** **ns0:gp**="http://www.pst.com/Protocols/Data/Generic"
and if put null in the first parameter
rootTreeNode.setAttributeNS(null ,"gp","http://www.pst.com/Protocols/Data/Generic")
I get just gp with the URL without the xmlns .
what am i doing wrong here ?
Thanks!!!
Complete test:
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element root = doc.createElementNS("http://schemas.microsoft.com/ado/2007","ex" + ":Ex");
root.setAttributeNS("http://www.w3.org/2000/xmlns/" ,"xmlns:gp","http://www.pst.com/Protocols/Data/Generic");
doc.appendChild(root);
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
String xmlString = sw.toString();
System.out.println("Xml:\n\n" + xmlString);

How to unformat xml file

I have a method which returns a String with a formatted xml. The method reads the xml from a file on the server and parses it into the string:
Esentially what the method currently does is:
private ServletConfig config;
InputStream xmlIn = null ;
xmlIn = config.getServletContext().getResourceAsStream(filename + ".xml") ;
String xml = IOUtils.toString(xmlIn);
IOUtils.closeQuietly(xmlIn);
return xml;
What I need to do is add a new input argument, and based on that value, continue returning the formatted xml, or return unformatted xml.
What I mean with formatted xml is something like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
And what I mean with unformatted xml is something like:
<xml><root><elements><elem1/><elem2/><elements><root></xml>
or:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
Is there a simple way to do this?
Strip all newline characters with String xml = IOUtils.toString(xmlIn).replace("\n", ""). Or \t to keep several lines but without indentation.
if you are sure that the formatted xml like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
you can replace all group 1 in ^(\s*)< to "". in this way, the text in xml won't be changed.
an empty transformer with a parameter setting the indent params like so
public static String getStringFromDocument(Document dom, boolean indented) {
String signedContent = null;
try {
StringWriter sw = new StringWriter();
DOMSource domSource = new DOMSource(dom);
TransformerFactory tf = new TransformerFactoryImpl();
Transformer trans = tf.newTransformer();
trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
trans.transform(domSource, new StreamResult(sw));
sw.flush();
signedContent = sw.toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return signedContent;
}
works for me.
the key lies in this line
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
Try something like the following:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource(new StringReader(
"<xsl:stylesheet version=\"1.0\"" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\">" +
"<xsl:output method=\"xml\" omit-xml-declaration=\"yes\"/>" +
" <xsl:strip-space elements=\"*\"/>" +
" <xsl:template match=\"#*|node()\">" +
" <xsl:copy>" +
" <xsl:apply-templates select=\"#*|node()\"/>" +
" </xsl:copy>" +
" </xsl:template>" +
"</xsl:stylesheet>"
))
);
Source source = new StreamSource(new StringReader("xml string here"));
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
Instead of source being StreamSource in the second instance, it can also be DOMSource if you have an in-memory Document, if you want to modify the DOM before saving.
DOMSource source = new DOMSource(document);
To read an XML file into a Document object:
File file = new File("c:\\MyXMLFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
Enjoy :)
If you fancy trying your hand with JAXB then the marshaller has a handy property for setting whether to format (use new lines and indent) the output or not.
JAXBContext jc = JAXBContext.newInstance(packageName);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
m.marshal(element, outputStream);
Quite an overhead to get to that stage though... perhaps a good option if you already have a solid xsd
You can:
1) remove all consecutive whitespaces (but not single whitespace) and then replace all >(whitespace)< by ><
applicable only if usefull content does not have multiple consecutive significant whitespaces
2) read it in some dom tree and serialize it using some nonpretty serialization
SAXReader reader = new SAXReader();
Reader r = new StringReader(data);
Document document = reader.read(r);
OutputFormat format = OutputFormat.createCompactFormat();
StringWriter sw = new StringWriter();
XMLWriter writer = new XMLWriter(sw, format);
writer.write(document);
String string = writer.toString();
3) use Canonicalization (but you must somehow explain to it that those whitespaces you want to remove are insignificant)
Kotlin.
An indentation will usually come after new line and formatted as one space or more. Hence, to make everything in the same column, we will replace all of the new lines, following one or more spaces:
xmlTag = xmlTag.replace("(\n +)".toRegex(), " ")

How to write contents of Document Object to String in NekoHTML?

I am using NekoHTML to parse contents of some HTML file..
Everything goes okay except for extracting the contents of the Document Object to some string.
I've tried uses
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
But nothing appears returned.
The problem where in Oracle App server 10.3.1.4 http://m-hewedy.blogspot.com/2011/04/oracle-application-server-overrides.html
Posible solution:
//this nekohtml
DOMParser parser = new DOMParser();
parser.parse(archivo);
//this xerces
OutputFormat format = new OutputFormat(parser.getDocument());
format.setIndenting(true);
//print xml for console
//XMLSerializer serializer = new XMLSerializer(System.out, format);
//save xml in string var
OutputStream outputStream = new ByteArrayOutputStream();
XMLSerializer serializer = new XMLSerializer(outputStream, format);
//process
serializer.serialize(parser.getDocument());
String xmlText = outputStream.toString();
System.out.println(xmlText);
//to generate a file output use fileoutputstream instead of system.out
//XMLSerializer serializer = new XMLSerializer(new FileOutputStream(new File("book.xml")), format);
Url: http://totheriver.com/learn/xml/xmltutorial.html#6.2
See e) Serialize DOM to FileOutputStream to generate the xml file "book.xml" .

Categories