Java:XML Parser

Java:XML Parser - java

I have a response XML something like this -
<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>
I want to extract the whole content from <Fromhere> to </Fromhere> in a string. Is it possible to do that through any string function or through XML parser?
Please advice.

You could try an XPath approach for simpleness in XML parsing:
InputStream response = new ByteArrayInputStream("<Response> <aa> "
+ "<Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> "
+ "</aa> </Response>".getBytes()); /* Or whatever. */
DocumentBuilder builder = DocumentBuilderFactory
.newInstance().newDocumentBuilder();
Document doc = builder.parse(response);
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("string(/Response/aa/FromHere)");
String result = (String)expr.evaluate(doc, XPathConstants.STRING);
Note that I haven't tried this code. It may need tweaking.

Through an XML parser. Using string functions to parse XML is a bad idea...
Beside the Sun tutorials pointed out above, you can check the DZone Refcardz on Java and XML, I found it was a good, terse explanation how to do it.
But well, there is probably plenty of Web resources on the topic, including on this very site.

You can apply an XSLT stylesheet to extract the desired content.
This stylesheet should fit your example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/Response/aa/Fromhere/*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Apply it with something like the following (exception handling not included):
String xml = "<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>";
Source xsl = new StreamSource(new FileReader("/path/to/file.xsl");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(xsl);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StringWriter out = new StringWriter();
transformer.transform(new StreamSource(new StringReader(xml)), new StreamResult(out));
System.out.println(out.toString());
This should work with any version of Java starting with 1.4.

This should work
import java.util.regex.*
Pattern p = Pattern.compile("<Fromhere>.*</Fromhere>");
Matcher m = p.matcher(responseString);
String whatYouWant = m.group();
It would be a little more verbose to use Scanner, but that could work too.
Whether this is a good idea is for someone more experienced than I.

One option is to use a StreamFilter:
class MyFilter implements StreamFilter {
private boolean on;
#Override
public boolean accept(XMLStreamReader reader) {
final String element = "Fromhere";
if (reader.isStartElement() && element.equals(reader.getLocalName())) {
on = true;
} else if (reader.isEndElement()
&& element.equals(reader.getLocalName())) {
on = false;
return true;
}
return on;
}
}
Combined with a Transformer, you can use this to safely parse logically-equivalent markup like this:
<Response>
<!-- <Fromhere></Fromhere> -->
<aa>
<Fromhere>
<a1>Content</a1> <a2>Content</a2>
</Fromhere>
</aa>
</Response>
Demo:
StringWriter writer = new StringWriter();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new StringReader(xmlString));
reader = inputFactory.createFilteredReader(reader, new MyFilter());
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new StAXSource(reader), new StreamResult(writer));
System.out.println(writer.toString());
This is a programmatic variation on Massimiliano Fliri's approach.

Related

How to prevent self-closing <tags/> in XML?

I modify XML file using the Transformer class and transform method. It correctly modify my parameters but changed XML style (write XML attributes in different way):
Original:
<a struct="b"></a>
<c></c>
After edit:
<a struct="b"/>
<c/>
I know that I can set properties: transformer.setOutputProperty(OutputKeys.KEY,value), but I did not find proper settings.
Can anyone help the transformer not change the write format?
XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()
Source src = new SAXSource(xr, new InputSource(new
StringReader(xmlArray[i])));
<<modify xml>>
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,"yes");
StringWriter buffer = new StringWriter();
transformer.transform(src, new StreamResult(buffer));
xmlArray[i] = buffer.toString();

Those forms are semantically equivalent. No conforming XML parser will care, and neither should you.

Extract XML element as string including attribute namespace using StAX

Given the following XML string
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="http://a" xmlns:b="http://b">
<a:element b:attribute="value">
<subelement/>
</a:element>
</root>
I'd like to extract the element a:element as an XML string while preserving the used namespaces using StAX. So I would expect
<?xml version="1.0" encoding="UTF-8"?>
<a:element xmlns:a="http://a" xmlns:b="http://b" b:attribute="value">
<subelement/>
</a:element>
Following answers like https://stackoverflow.com/a/5170415/2391901 and https://stackoverflow.com/a/4353531/2391901, I already have the following code:
final ByteArrayInputStream inputStream = new ByteArrayInputStream(inputString.getBytes(StandardCharsets.UTF_8));
final XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
final XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(inputStream);
xmlStreamReader.nextTag();
xmlStreamReader.nextTag();
final TransformerFactory transformerFactory = TransformerFactory.newInstance();
final Transformer transformer = transformerFactory.newTransformer();
final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
transformer.transform(new StAXSource(xmlStreamReader), new StreamResult(outputStream));
final String outputString = outputStream.toString(StandardCharsets.UTF_8.name());
However, the result does not contain the namespace http://b of the attribute b:attribute (using either the default StAX parser of Java 8 or the StAX parser of Aalto XML):
<?xml version="1.0" encoding="UTF-8"?>
<a:element xmlns:a="http://a" b:attribute="value">
<subelement/>
</a:element>
How do I get the expected result using StAX?

It would be cleaner to use an xslt transform to do this. You're already using an identity transformer to perform output - just set it up to copy the target element instead of everything:
public static void main(String[] args) throws TransformerException {
String inputString =
"<root xmlns:a='http://a' xmlns:b='http://b'>" +
" <a:element b:attribute='value'>" +
" <subelement/>" +
" </a:element>" +
"</root>";
String xslt =
"<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:a='http://a'>" +
" <xsl:template match='/root'>" +
" <xsl:copy-of select='a:element'/>" +
" </xsl:template>" +
"</xsl:stylesheet>";
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new StringReader(xslt)));
transformer.transform(new StreamSource(new StringReader(inputString)), new StreamResult(System.out));
}
The stax subtree transform that you're using relies on some iffy behaviour of the transformer that ships with the jdk. It didn't work when I tried it with the Saxon transformer (which complained about the trailing </root>).

How to avoid encoding of <,>,& with Document.createTextNode

class XMLencode
{
public static void main(String[] args)
{
try{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element root = doc.createElement("roseindia");
doc.appendChild(root);
Text elmnt=doc.createTextNode("<data>sun</data><abcdefg/><end/>");
root.appendChild(elmnt);
TransformerFactory tranFactory = TransformerFactory.newInstance();
Transformer aTransformer = tranFactory.newTransformer();
Source src = new DOMSource(doc);
Result dest = new StreamResult(System.out);
aTransformer.transform(src, dest);
}catch(Exception e){
System.out.println(e.getMessage());
}
}
}
Here is my above piece of code.
The output generated is like this
<?xml version="1.0" encoding="UTF-8" standalone="no"?><roseindia><data>sun</data><abcdefg/><end/></roseindia>
I dont want the tags to be encoded. I need the output in this fashion.
<?xml version="1.0" encoding="UTF-8" standalone="no"?><roseindia><data>sun</data><abcdefg/><end/></roseindia>
Please help me on this.
Thanks,
Mohan

Short Answer
You could leverage the CDATA mechanism in XML to prevent characters from being escaped. Below is an example of the DOM code:
doc.createCDATASection("<foo/>");
The content will be:
<![CDATA[<foo/>]]>
LONG ANSWER
Below is a complete example of leveraging a CDATA section using the DOM APIs.
package forum12525152;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.*;
public class Demo {
public static void main(String[] args) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.newDocument();
Element rootElement = document.createElement("root");
document.appendChild(rootElement);
// Create Element with a Text Node
Element fooElement = document.createElement("foo");
fooElement.setTextContent("<foo/>");
rootElement.appendChild(fooElement);
// Create Element with a CDATA Section
Element barElement = document.createElement("bar");
CDATASection cdata = document.createCDATASection("<bar/>");
barElement.appendChild(cdata);
rootElement.appendChild(barElement);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
t.transform(source, result);
}
}
Output
Note the difference in the foo and bar elements even though they have similar content. I have formatted the result of running the demo code to make it more readable:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<foo><foo/></foo>
<bar><![CDATA[<bar/>]]></bar>
</root>

Instead of writing like this doc.createTextNode("<data>sun</data><abcdefg/><end/>");
You should create each element.
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.w3c.dom.*;
class XMLencode {
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element root = doc.createElement("roseindia");
doc.appendChild(root);
Element data = doc.createElement("data");
root.appendChild(data);
Text elemnt = doc.createTextNode("sun");
data.appendChild(elemnt);
Element data1 = doc.createElement("abcdefg");
root.appendChild(data1);
//Text elmnt = doc.createTextNode("<data>sun</data><abcdefg/><end/>");
//root.appendChild(elmnt);
TransformerFactory tranFactory = TransformerFactory.newInstance();
Transformer aTransformer = tranFactory.newTransformer();
Source src = new DOMSource(doc);
Result dest = new StreamResult(System.out);
aTransformer.transform(src, dest);
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}

You can use the doc.createTextNode and use a workaround (long) for the escaped characters.
SOAPMessage msg = messageContext.getMessage();
header.setTextContent(seched);
Then use
Source src = msg.getSOAPPart().getContent();
To get the content, the transform it to string
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer. setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result1 = new StreamResult(new StringWriter());
transformer.transform(src, result1);
Replace the string special characters
String xmlString = result1.getWriter().toString()
.replaceAll("<", "<").
replaceAll(">", ">");
System.out.print(xmlString);
the oposite string to dom with the fixed escaped characters
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlString));
Document doc = db.parse(is);
Source src123 = new DOMSource(doc);
Then set it back to the soap message
msg.getSOAPPart().setContent(src123);

Don't use createTextNode - the whole point of it is to insert some text (as data) into the document, not a fragment of raw XML.
Use a combination of createTextNode for the text and createElement for the elements.

I dont want the tags to be encoded. I need the output in this fashion.
Then you don't want a text node at all - which is why createTextNode isn't working for you. (Or rather, it's working fine - it's just not doing what you want). You should probably just parse your XML string, then import the document node from the result into your new document.
Of course, if you know the elements beforehand, don't express them as text in the first place - use a mixture of createElement, createAttribute, createTextNode and appendChild to create the structure.
It's entirely possible that something like JDOM will make this simpler, but that's the basic approach.

Mohan,
You can't use Document.createTextNode(). That methos transforms (or escapes) the charactes in your XML.
Instead, you need to build two separate Documents from the 2 XML's and use importNode.
I use Document.importNode() like this to solve my problem:
Build your builders:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbf.newDocumentBuilder();
Document oldDoc = builder.parse(isOrigXml); //this is XML as InputSource
Document newDoc = builder.parse(isInsertXml); //this is XML as InputSource
Next, build a NodeList of the Element/Node you want to import. Create a Node from the NodeList. Create another Node of what you are going to import using importNode. Build the last Node of the final XML as such:
NodeList nl = newDoc.getElementByTagName("roseindia"); //or whatever the element name is
Node xmlToInsert = nl.item(0);
Node importNode = oldDoc.importNode(xmlToImport, true);
Node target = ((NodeList) oldDoc.getElementsByTagName("ELEMENT_NAME_OF_LOCATION")).item(0);
target.appendChild(importNode);
Source source = new DOMSource(target);
....
The rest is standard Transformer - StringWriter to StreamResult stuff to get the results.

How to unformat xml file

I have a method which returns a String with a formatted xml. The method reads the xml from a file on the server and parses it into the string:
Esentially what the method currently does is:
private ServletConfig config;
InputStream xmlIn = null ;
xmlIn = config.getServletContext().getResourceAsStream(filename + ".xml") ;
String xml = IOUtils.toString(xmlIn);
IOUtils.closeQuietly(xmlIn);
return xml;
What I need to do is add a new input argument, and based on that value, continue returning the formatted xml, or return unformatted xml.
What I mean with formatted xml is something like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
And what I mean with unformatted xml is something like:
<xml><root><elements><elem1/><elem2/><elements><root></xml>
or:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
Is there a simple way to do this?

Strip all newline characters with String xml = IOUtils.toString(xmlIn).replace("\n", ""). Or \t to keep several lines but without indentation.

if you are sure that the formatted xml like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
you can replace all group 1 in ^(\s*)< to "". in this way, the text in xml won't be changed.

an empty transformer with a parameter setting the indent params like so
public static String getStringFromDocument(Document dom, boolean indented) {
String signedContent = null;
try {
StringWriter sw = new StringWriter();
DOMSource domSource = new DOMSource(dom);
TransformerFactory tf = new TransformerFactoryImpl();
Transformer trans = tf.newTransformer();
trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
trans.transform(domSource, new StreamResult(sw));
sw.flush();
signedContent = sw.toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return signedContent;
}
works for me.
the key lies in this line
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");

Try something like the following:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource(new StringReader(
"<xsl:stylesheet version=\"1.0\"" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\">" +
"<xsl:output method=\"xml\" omit-xml-declaration=\"yes\"/>" +
" <xsl:strip-space elements=\"*\"/>" +
" <xsl:template match=\"#*|node()\">" +
" <xsl:copy>" +
" <xsl:apply-templates select=\"#*|node()\"/>" +
" </xsl:copy>" +
" </xsl:template>" +
"</xsl:stylesheet>"
))
);
Source source = new StreamSource(new StringReader("xml string here"));
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
Instead of source being StreamSource in the second instance, it can also be DOMSource if you have an in-memory Document, if you want to modify the DOM before saving.
DOMSource source = new DOMSource(document);
To read an XML file into a Document object:
File file = new File("c:\\MyXMLFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
Enjoy :)

If you fancy trying your hand with JAXB then the marshaller has a handy property for setting whether to format (use new lines and indent) the output or not.
JAXBContext jc = JAXBContext.newInstance(packageName);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
m.marshal(element, outputStream);
Quite an overhead to get to that stage though... perhaps a good option if you already have a solid xsd

You can:
1) remove all consecutive whitespaces (but not single whitespace) and then replace all >(whitespace)< by ><
applicable only if usefull content does not have multiple consecutive significant whitespaces
2) read it in some dom tree and serialize it using some nonpretty serialization
SAXReader reader = new SAXReader();
Reader r = new StringReader(data);
Document document = reader.read(r);
OutputFormat format = OutputFormat.createCompactFormat();
StringWriter sw = new StringWriter();
XMLWriter writer = new XMLWriter(sw, format);
writer.write(document);
String string = writer.toString();
3) use Canonicalization (but you must somehow explain to it that those whitespaces you want to remove are insignificant)

Kotlin.
An indentation will usually come after new line and formatted as one space or more. Hence, to make everything in the same column, we will replace all of the new lines, following one or more spaces:
xmlTag = xmlTag.replace("(\n +)".toRegex(), " ")

Remove the XML header from an XML in Java

StringWriter writer = new StringWriter();
XmlSerializer serializer = new KXmlSerializer();
serializer.setOutput(writer);
serializer.startDocument(null, null);
serializer.setFeature("http://xmlpull.org/v1/doc/features.html#indent-output", true);
// Creating XML
serializer.endDocument();
String xmlString = writer.toString();
In the above environment, whether there are any standard API's available to remove the XML header <?xml version='1.0' ?> or do you suggest to go via string manipulation:
if (s.startsWith("<?xml ")) {
s = s.substring(s.indexOf("?>") + 2);
}
Wanted the output in the xmlString without XML header info <?xml version='1.0' ?>.

Ideally you can make an API call to exclude the XML header if desired. It doesn't appear that KXmlSerializer supports this though (skimming through the code here). If you had a org.w3c.dom.Document (or actually any other implementation of javax.xml.transform.Source) you could accomplish what you want this way:
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(doc), new StreamResult(writer));
Otherwise if you have to use KXmlSerializer it looks like you'll have to manipulate the output.

If you use a JAXP serializer you get access to all the output properties defined in XSLT, for example omit-xml-declaration="yes". You can get this in the form of an "identity transformer", called using transformerFactory.getTransformer() with no parameters, on which you then call setOutputProperty(). Another example:
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.setOutputProperty("omit-xml-declaration", "yes");

Don't make call to:
serializer.startDocument();
It adds the XML header, though you need to call:
serializer.endDocument();
else your XML will be created as a blank String.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java:XML Parser - java

Related

How to prevent self-closing <tags/> in XML?

Extract XML element as string including attribute namespace using StAX

How to avoid encoding of <,>,& with Document.createTextNode

How to unformat xml file

Remove the XML header from an XML in Java

Categories

Resources