How to unformat xml file - java

I have a method which returns a String with a formatted xml. The method reads the xml from a file on the server and parses it into the string:
Esentially what the method currently does is:
private ServletConfig config;
InputStream xmlIn = null ;
xmlIn = config.getServletContext().getResourceAsStream(filename + ".xml") ;
String xml = IOUtils.toString(xmlIn);
IOUtils.closeQuietly(xmlIn);
return xml;
What I need to do is add a new input argument, and based on that value, continue returning the formatted xml, or return unformatted xml.
What I mean with formatted xml is something like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
And what I mean with unformatted xml is something like:
<xml><root><elements><elem1/><elem2/><elements><root></xml>
or:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
Is there a simple way to do this?

Strip all newline characters with String xml = IOUtils.toString(xmlIn).replace("\n", ""). Or \t to keep several lines but without indentation.

if you are sure that the formatted xml like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
you can replace all group 1 in ^(\s*)< to "". in this way, the text in xml won't be changed.

an empty transformer with a parameter setting the indent params like so
public static String getStringFromDocument(Document dom, boolean indented) {
String signedContent = null;
try {
StringWriter sw = new StringWriter();
DOMSource domSource = new DOMSource(dom);
TransformerFactory tf = new TransformerFactoryImpl();
Transformer trans = tf.newTransformer();
trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
trans.transform(domSource, new StreamResult(sw));
sw.flush();
signedContent = sw.toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return signedContent;
}
works for me.
the key lies in this line
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");

Try something like the following:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource(new StringReader(
"<xsl:stylesheet version=\"1.0\"" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\">" +
"<xsl:output method=\"xml\" omit-xml-declaration=\"yes\"/>" +
" <xsl:strip-space elements=\"*\"/>" +
" <xsl:template match=\"#*|node()\">" +
" <xsl:copy>" +
" <xsl:apply-templates select=\"#*|node()\"/>" +
" </xsl:copy>" +
" </xsl:template>" +
"</xsl:stylesheet>"
))
);
Source source = new StreamSource(new StringReader("xml string here"));
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
Instead of source being StreamSource in the second instance, it can also be DOMSource if you have an in-memory Document, if you want to modify the DOM before saving.
DOMSource source = new DOMSource(document);
To read an XML file into a Document object:
File file = new File("c:\\MyXMLFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
Enjoy :)

If you fancy trying your hand with JAXB then the marshaller has a handy property for setting whether to format (use new lines and indent) the output or not.
JAXBContext jc = JAXBContext.newInstance(packageName);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
m.marshal(element, outputStream);
Quite an overhead to get to that stage though... perhaps a good option if you already have a solid xsd

You can:
1) remove all consecutive whitespaces (but not single whitespace) and then replace all >(whitespace)< by ><
applicable only if usefull content does not have multiple consecutive significant whitespaces
2) read it in some dom tree and serialize it using some nonpretty serialization
SAXReader reader = new SAXReader();
Reader r = new StringReader(data);
Document document = reader.read(r);
OutputFormat format = OutputFormat.createCompactFormat();
StringWriter sw = new StringWriter();
XMLWriter writer = new XMLWriter(sw, format);
writer.write(document);
String string = writer.toString();
3) use Canonicalization (but you must somehow explain to it that those whitespaces you want to remove are insignificant)

Kotlin.
An indentation will usually come after new line and formatted as one space or more. Hence, to make everything in the same column, we will replace all of the new lines, following one or more spaces:
xmlTag = xmlTag.replace("(\n +)".toRegex(), " ")

Related

Java Transformer converts Chinese character to ASCII value [duplicate]

This question already has answers here:
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") is NOT working
(8 answers)
Closed 3 years ago.
Ok after lot of search I decided to ask question here. Below is the sample code to reproduce my problem. The document object is build with chinese character.
String value= "𧀠";
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");
root.setAttribute("attribute", value);
doc.appendChild(root);
DOMSource source = new DOMSource(doc);
I am trying to convert the document source to string using the Transformer class with the below code.
ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(source, htmlStreamResult);
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-8" );
But I got output with converted Chinese characters as below.
<?xml version="1.0" encoding="UTF-8" standalone="no"?><value attribute="𧀠"/>
I do not want the Chinese character to be converted but to be displayed as it is. Appreciate if anyone help me on this.
Change UTF-8 to UTF-16. Since you're making a String (which is code-page agnostic) this has no ill effect on the encoding. This however adds code-page declaration and sometimes a BOM (Byte-Order-Mark) in the XML header. You can optionally leave the header out and attach your own.
String value= "𧀠かな〜"; // (I don't see your character so I added some of my own)
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");
root.setAttribute("attribute", value);
doc.appendChild(root);
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
// transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); // optional
transformer.transform(source, htmlStreamResult);
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-16" );
System.out.println(outPut);
Output:
<?xml version="1.0" encoding="UTF-16" standalone="no"?><value attribute="𧀠かな〜"/>

How to unescape string in XML using Transformer?

I've a function which takes a XML document as parameter and writes it to the file. It contains element as <tag>"some text & some text": <text> text</tag> but in output file it's written as <tag>"some text & some text": <text> text</tag> But I don't want string to be escaped while writing to the file.
Function is,
public static void function(Document doc, String fileUri, String randomId){
DOMSource source = new DOMSource(doc,ApplicationConstants.ENC_UTF_8);
FileWriterWithEncoding writer = null;
try {
File file = new File(fileUri+File.separator+randomId+".xml");
if (!new File(fileUri).exists()){
new File(fileUri).mkdirs();
}
writer = new FileWriterWithEncoding(new File(file.toString()),ApplicationConstants.ENC_UTF_8);
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
transformer = transformerFactory.newTransformer();
transformer.setParameter(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
writer.close();
transformer.clearParameters();
}catch (IOException | TransformerException e) {
log.error("convert Exception is :"+ e);
}
}
There are five escape characters in XML ("'<>&). According to XML grammar, they must be escaped in certain places in XML, please see this question:
What characters do I need to escape in XML documents?
So you can't to much, for instance, to avoid escaping & or < in text content.
You could use CDATA sections if you want to retain "unescaped" content. Please see this question:
Add CDATA to an xml file

Xml parsing for in between xml data with tags in java

Xml parsing for inbetween xml data
I have a XML string in one my java String objects as below:
<Record>
<op>Add</op>
<sensdata>400188711111</sensdata>
<id>4</id>
<a1>1111201090467034</a1>
</Record>
If i need the data between
<Record> </Record>
i.e
<op>Add</op>
<sensdata>4001887XXXXX</sensdata>
<id>4</id>
<a1>1111201090467034</a1>
Can I get using the xml parser. I am able to get the values like Add40018871111141111201090467034. But not with tags.
Below is my code snippet
ByteArrayInputStream stream = new ByteArrayInputStream("<Record><op>Add</op><sensdata>400188711111</sensdata><id>4</id><a1>1111201090467034</a1></Record>".getBytes());
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(stream);
NodeList sensdata = document.getDocumentElement().getElementsByTagName("sensdata");
String sensitiveData = sensdata.item(0).getTextContent();
Editing my question with the Solution i have tried:
I did as below:
ByteArrayInputStream stream = new ByteArrayInputStream(toBeParsed.getBytes());
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(stream);
NodeList XmlTagNodeList = document.getDocumentElement().getElementsByTagName(XmlTag);
Document newXmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().newDocument();
for (int i = 0; i < XmlTagNodeList.getLength(); i++) {
Node node = XmlTagNodeList.item(i);
Node copyNode = newXmlDocument.importNode(node, true);
newXmlDocument.appendChild(copyNode);
}
DOMImplementationLS domImplementationLS = (DOMImplementationLS) newXmlDocument.getImplementation();
LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
record = lsSerializer.writeToString(newXmlDocument);
System.out.println(record);
It prints record prepended with xml tag for every record.
Kindly let me know, is it a good way to do it? And I do not require xml tag. How to get rid of this?
If you are trying to output the XML literally, with the tags and stuff, you can do it by using the Transformer API available at javax.xml.transform.*
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); // I prefer to have the <?xml ?> tag
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes"); // Indent the code
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); // Original encoding
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4"); // Indent 4 spaces
// write the XML of a single node to an OutputStream
transformer.transform(new DOMSource(node), new StreamResult(new OutputStreamWriter(out, "UTF-8")));
Is this what you need?
Edit (looping through the nodes):
// Your previous code
LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
StringBuilder builder = new StringBuilder();
NodeList children = newXmlDocument.getChildNodes();
for(int x = 0; x < children.getLength(); x++)
{
Node node = children.item(x);
builder.append(lsSerializer.writeToString(node));
builder.append("\n");
}
System.out.println(builder.toString());
Is that any help?

Bad Characters when parsing GML in Java

I'm using the org.w3c.dom package to parse the gml schemas (http://schemas.opengis.net/gml/3.1.0/base/).
When I parse the gmlBase.xsd schema and then save it back out, the quote characters around GeometryCollections in the BagType complex type come out converted to bad characters (See code below).
Is there something wrong with how I'm parsing or saving the xml, or is there something in the schema that is off?
Thanks,
Curtis
public static void main(String[] args) throws IOException
{
File schemaFile = File.createTempFile("gml_", ".xsd");
FileUtils.writeStringToFile(schemaFile, getSchema(new URL("http://schemas.opengis.net/gml/3.1.0/base/gmlBase.xsd")));
System.out.println("wrote file: " + schemaFile.getAbsolutePath());
}
public static String getSchema(URL schemaURL)
{
try
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(IOUtils.toString(schemaURL.openStream()))));
Element rootElem = doc.getDocumentElement();
rootElem.normalize();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream xmlOutStream = new ByteArrayOutputStream();
StreamResult result = new StreamResult(xmlOutStream);
transformer.transform(source, result);
return xmlOutStream.toString();
}
catch (Exception e)
{
e.printStackTrace();
}
return "";
}
I'm suspicious of this line:
Document doc = db.parse(new InputSource(
new StringReader(IOUtils.toString(schemaURL.openStream()))));
I don't know what IOUtils.toString does here but presumably it's assuming a particular encoding, without taking account of the XML declaration.
Why not just use:
Document doc = db.parse(schemaURL.openStream());
Likewise your FileUtils.writeStringToFile doesn't appear to specify a character encoding... which encoding does it use, and why encoding is in the StreamResult?

Java:XML Parser

I have a response XML something like this -
<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>
I want to extract the whole content from <Fromhere> to </Fromhere> in a string. Is it possible to do that through any string function or through XML parser?
Please advice.
You could try an XPath approach for simpleness in XML parsing:
InputStream response = new ByteArrayInputStream("<Response> <aa> "
+ "<Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> "
+ "</aa> </Response>".getBytes()); /* Or whatever. */
DocumentBuilder builder = DocumentBuilderFactory
.newInstance().newDocumentBuilder();
Document doc = builder.parse(response);
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("string(/Response/aa/FromHere)");
String result = (String)expr.evaluate(doc, XPathConstants.STRING);
Note that I haven't tried this code. It may need tweaking.
Through an XML parser. Using string functions to parse XML is a bad idea...
Beside the Sun tutorials pointed out above, you can check the DZone Refcardz on Java and XML, I found it was a good, terse explanation how to do it.
But well, there is probably plenty of Web resources on the topic, including on this very site.
You can apply an XSLT stylesheet to extract the desired content.
This stylesheet should fit your example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/Response/aa/Fromhere/*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Apply it with something like the following (exception handling not included):
String xml = "<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>";
Source xsl = new StreamSource(new FileReader("/path/to/file.xsl");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(xsl);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StringWriter out = new StringWriter();
transformer.transform(new StreamSource(new StringReader(xml)), new StreamResult(out));
System.out.println(out.toString());
This should work with any version of Java starting with 1.4.
This should work
import java.util.regex.*
Pattern p = Pattern.compile("<Fromhere>.*</Fromhere>");
Matcher m = p.matcher(responseString);
String whatYouWant = m.group();
It would be a little more verbose to use Scanner, but that could work too.
Whether this is a good idea is for someone more experienced than I.
One option is to use a StreamFilter:
class MyFilter implements StreamFilter {
private boolean on;
#Override
public boolean accept(XMLStreamReader reader) {
final String element = "Fromhere";
if (reader.isStartElement() && element.equals(reader.getLocalName())) {
on = true;
} else if (reader.isEndElement()
&& element.equals(reader.getLocalName())) {
on = false;
return true;
}
return on;
}
}
Combined with a Transformer, you can use this to safely parse logically-equivalent markup like this:
<Response>
<!-- <Fromhere></Fromhere> -->
<aa>
<Fromhere>
<a1>Content</a1> <a2>Content</a2>
</Fromhere>
</aa>
</Response>
Demo:
StringWriter writer = new StringWriter();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new StringReader(xmlString));
reader = inputFactory.createFilteredReader(reader, new MyFilter());
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new StAXSource(reader), new StreamResult(writer));
System.out.println(writer.toString());
This is a programmatic variation on Massimiliano Fliri's approach.

Categories