Keep literal strings between elements xml

Keep literal strings between elements xml - java

I am trying to parse a simple XML file. It looks like this
<?xml version="1.0" encoding="utf-8">
<resources xmlns:ns1="urn:oasis:names:tc:xliff:document:1.2">
<string name="action_settings">Settings</string>
<string name="app_name">Colatris Sample</string>
<string name="cdata"><![CDATA[<p>Text<p>]]></string>
<string name="content_description_sample">Something</string>
<string name="countdown"><xliff:g example="5 days" id="time">%1$s</xliff:g> until holiday</string>
</resources>
This is my parsing method:
List<CsString> extract(Document document) throws CsException {
List<CsString> csStrings = new ArrayList<>();
Element resources = document.getDocumentElement();
NodeList strings = resources.getElementsByTagName("string");
for (int i = 0; i < strings.getLength(); i++) {
Node string = strings.item(i);
csStrings.add(new CsString(string.getAttributes().getNamedItem("name").getNodeValue(), string.getTextContent()));
}
return csStrings;
}
I am building the passed Document with this method.
Document getDocument() throws CsException {
try {
Application application = core.getApplication();
AssetManager assetManager = application.getAssets();
InputStream inputStream = assetManager.open("colatris/values.xml");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringElementContentWhitespace(true);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(inputStream);
} catch (IOException | ParserConfigurationException | SAXException e) {
throw new CsException("Unable to get parser");
}
}
Everything is working great. Except for the cdata and countdown elements. I want to just get the literal between the string elements. However, the parser is only returning the text inside of CDATA and stripping out the xliff tags.
String countdown = %1$s until holiday
String cdata = <p>Text<p>
I want the parsed strings to look like this so I can persist them literally. I need to be able to reconstruct XML down the road with the meta data in the correct places.
String countdown = <ns1:g example="5 days" id="time">%1$s</ns1:g> until holiday
String cdata = <![CDATA[<p>Text<p>]]>
Are there are any configuration tricks for Document in order to keep the nodes between two elements as literal strings? For most users strpping CDATA makes sense but I need to get around that.

The reason is of course that you are just extracting the text from the string element. What you should do is to get the sub-node (or maybe sub-nodes, don't know the exact layout of your files) and output them again using a javax.xml.transform.Transformer. The code would look something like:
NodeList list = document.getDocumentElement().getElementsByTagName("string");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty("omit-xml-declaration", "yes");
for (int i = 0; i < list.getLength(); i++) {
Node node = list.item(i);
Node child = node.getFirstChild();
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(child), new StreamResult(writer));
System.out.println(writer.toString()); // Do your list thing in stead
}

Related

Retrieve XML Element names with Java from unknown message format

I am parsing XML from lots of JMS messaging topics, so the structure of each message varies a lot and I'd like to make one general tool to parse them all.
To start, all I want to do is get the element names:
<gui-action>
<action>some action</action>
<params>
<param1>blue</param1>
<param2>tall</param2>
<params>
</gui-action>
I just want to retrieve the strings "gui-action", "action", "params", "param1", and "param2." Duplicates are just fine.
I've tried using org.w3c.dom.Node, Element, NodeLists and I'm not having much luck. I keep getting the element values, not the names.
private Element root;
private Document doc;
private NodeList nl;
//messageStr is passed in elsewhere in the code
//but is a string of the full XML message.
doc = xmlParse( messageStr );
root = doc.getDocumentElement();
nl = root.getChildNodes();
int size = nl.getLength();
for (int i=0; i<size; i++) {
log.info( nl.item(i).getNodeName() );
}
public Document xmlParse( String xml ){
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
InputSource is;
try {
//Using factory get an instance of document builder
db = dbf.newDocumentBuilder();
is = new InputSource(new StringReader( xml ) );
doc = db.parse( is );
} catch(ParserConfigurationException pce) {
pce.printStackTrace();
} catch(SAXException se) {
se.printStackTrace();
} catch(IOException ioe) {
ioe.printStackTrace();
}
return doc;
//parse using builder to get DOM representation of the XML file
}
My logged "parsed" XML looks like this:
#text
action
#text
params
#text

Figured it out. I was iterating over only the child nodes, and not including the parent. So now I just filter out the #texts, and include the parent. Derp.
log.info(root.getNodeName() );
for (int i=0; i<size; i++) {
nodeName = nl.item(i).getNodeName();
if( nodeName != "#text" ) {
log.info( nodeName );
}
}
Now if anyone knows a way to get a NodeList of the entire document, that would be awesome.

Keep numeric character entity characters such as ` ` when parsing XML in Java

I am parsing XML that contains numeric character entity characters such as (but not limited to)
< > (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.
However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.
How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?
Example of demo xml file:
<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">
<Field attributeWithChar="A string followed by special symbols
" />
</ABCD>
Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no
) symbols.
What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
Document document = null;
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
document = documentBuilder.parse(new File("path/to/demo.xml"));
StringBuilder sb = new StringBuilder();
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
NamedNodeMap nnp = node.getAttributes();
for (int j = 0; j < nnp.getLength(); j++) {
sb.append(nnp.item(j).getTextContent());
}
}
}
System.out.println(sb.toString());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
writer.write(sb.toString());
}
}

You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,
DocumentBuilder documentBuilder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
Document document = documentBuilder.parse(
new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
));
Output :
2A string followed by special symbols

P.S. This is complement of Ravi Thapliyal's answer, not an alternative.
I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as
along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.
When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
t.transform(source, result);
//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();

JDOM getChildren() returns empty list

this is my xml:
Example:
<?xml version="1.0" encoding="UTF_8" standalone="yes"?>
<StoreMessage xmlns="http://www.xxx.com/feed">
<billingDetail>
<billingDetailId>987</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F8DJH348DJ</pan>
<contractSerialNumber>46446</contractSerialNumber>
</userId>
<declaredVehicleClass>A</declaredVehicleClass>
</billingDetail>
<billingDetail>
<billingDetailId>543</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F854534534348DJ</pan>
<contractSerialNumber>4666546446</contractSerialNumber>
</userId>
<declaredVehicleClass>C</declaredVehicleClass>
</billingDetail>
</StoreMessage>
With JDOM parser i want to get all <billingDetail> xml nodes from it.
my code:
SAXBuilder builder = new SAXBuilder();
try {
Reader in = new StringReader(xmlAsString);
Document document = (Document)builder.build(in);
Element rootNode = document.getRootElement();
List<?> list = rootNode.getChildren("billingDetail");
XMLOutputter outp = new XMLOutputter();
outp.setFormat(Format.getCompactFormat());
for (int i = 0; i < list.size(); i++) {
Element node = (Element)list.get(i);
StringWriter sw = new StringWriter();
outp.output(node.getContent(), sw);
StringBuffer sb = sw.getBuffer();
String text = sb.toString();
xmlRecords.add(sb.toString());
}
} catch (IOException io) {
io.printStackTrace();
} catch (JDOMException jdomex) {
jdomex.printStackTrace();
}
but i never get as output xml node as string like:
<billingDetail>
<billingDetailId>987</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F8DJH348DJ</pan>
<contractSerialNumber>46446</contractSerialNumber>
</userId>
<declaredVehicleClass>A</declaredVehicleClass>
</billingDetail>
what i am doing wrong? How can i get this output with JDOM parser?
EDIT
And why if XML start with
<StoreMessage> instead like <StoreMessage xmlns="http://www.xxx.com/MediationFeed">
then works? How is this possible?

The problem is that there are two versions of the getChildren method:
java.util.List getChildren(java.lang.String name)
This returns a List of all the child elements nested directly (one level deep) within this element with the given local name and belonging to no namespace, returned as Element objects.
and
java.util.List getChildren(java.lang.String name, Namespace ns)
This returns a List of all the child elements nested directly (one level deep) within this element with the given local name and belonging to the given Namespace, returned as Element objects.
The first one doesn't find your node if it belongs to a namespace, you should use the second one.

Adding namespace to an already created XML document

I am creating a W3C Document object using a String value. Once I created the Document object, I want to add a namespace to the root element of this document. Here's my current code:
Document document = builder.parse(new InputSource(new StringReader(xmlString)));
document.getDocumentElement().setAttributeNS("http://com", "xmlns:ns2", "Test");
document.setPrefix("ns2");
TransformerFactory tranFactory = TransformerFactory.newInstance();
Transformer aTransformer = tranFactory.newTransformer();
Source src = new DOMSource(document);
Result dest = new StreamResult(new File("c:\\xmlFileName.xml"));
aTransformer.transform(src, dest);
What I use as input:
<product>
<arg0>DDDDDD</arg0>
<arg1>DDDD</arg1>
</product>
What the output should look like:
<ns2:product xmlns:ns2="http://com">
<arg0>DDDDDD</arg0>
<arg1>DDDD</arg1>
</ns2:product>
I need to add the prefix value and namespace also to the input xml string. If I try the above code I am getting this exception:
NAMESPACE_ERR: An attempt is made to create or change an object in a way which is incorrect with regard to namespaces.
Appreciate your help!

Since there is not an easy way to rename the root element, we'll have to replace it with an element that has the correct namespace and attribute, and then copy all the original children into it. Forcing the namespace declaration is not needed because by giving the element the correct namespace (URI) and setting the prefix, the declaration will be automatic.
Replace the setAttribute and setPrefix with this (line 2,3)
String namespace = "http://com";
String prefix = "ns2";
// Upgrade the DOM level 1 to level 2 with the correct namespace
Element originalDocumentElement = document.getDocumentElement();
Element newDocumentElement = document.createElementNS(namespace, originalDocumentElement.getNodeName());
// Set the desired namespace and prefix
newDocumentElement.setPrefix(prefix);
// Copy all children
NodeList list = originalDocumentElement.getChildNodes();
while(list.getLength()!=0) {
newDocumentElement.appendChild(list.item(0));
}
// Replace the original element
document.replaceChild(newDocumentElement, originalDocumentElement);
In the original code the author tried to declare an element namespace like this:
.setAttributeNS("http://com", "xmlns:ns2", "Test");
The first parameter is the namespace of the attribute, and since it's a namespace attribute it need to have the http://www.w3.org/2000/xmlns/ URI. The declared namespace should come into the 3rd parameter
.setAttributeNS("http://www.w3.org/2000/xmlns/", "xmlns:ns2", "http://com");

Bellow approach also works for me, but probably should not use in performance critical case.
Add name space to document root element as attribute.
Transform the document to XML string. The purpose of this step is to make the child element in the XML string inherit parent element namespace.
Now the xml string have name space.
You can use the XML string to build a document again or used for JAXB unmarshal, etc.
private static String addNamespaceToXml(InputStream in)
throws ParserConfigurationException, SAXException, IOException,
TransformerException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
/*
* Must not namespace aware, otherwise the generated XML string will
* have wrong namespace
*/
// dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(in);
Element documentElement = document.getDocumentElement();
// Add name space to root element as attribute
documentElement.setAttribute("xmlns", "http://you_name_space");
String xml = transformXmlNodeToXmlString(documentElement);
return xml;
}
private static String transformXmlNodeToXmlString(Node node)
throws TransformerException {
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
StringWriter buffer = new StringWriter();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(node), new StreamResult(buffer));
String xml = buffer.toString();
return xml;
}

Partially gleaned from here, and also from a comment above, I was able to get it to work (transforming an arbitrary DOM Node and adding a prefix to it and all its children) thus:
private String addNamespacePrefix(Document doc, Node node) throws TransformerException {
Element mainRootElement = doc.createElementNS(
"http://abc.de/x/y/z", // namespace
"my-prefix:fake-header-element" // prefix to "register" it with the DOM so we don't get exceptions later...
);
List<Element> descendants = nodeListToArrayRecurse(node.getChildNodes()); // for some reason we have to grab all these before doing the first "renameNode" ... no idea why ...
mainRootElement.appendChild(node);
doc.renameNode(node, "http://abc.de/x/y/z", "my-prefix:" + node.getNodeName());
descendants.stream().forEach(c -> doc.renameNode(c, "http://abc.de/x/y/z", "my-prefix:" + c.getNodeName()));
}
private List<Element> nodeListToArrayRecurse(NodeList entryNodes) {
List<Element> allEntries = new ArrayList<>();
for (int i = 0; i < entryNodes.getLength(); i++) {
Node child = entryNodes.item(i);
if (child.getNodeType() == Node.ELEMENT_NODE) {
allEntries.add((Element) child);
allEntries.addAll(nodeListToArray(child.getChildNodes())); // recurse
} // ignore other [i.e. text] nodes https://stackoverflow.com/questions/14566596/loop-through-all-elements-in-xml-using-nodelist
}
return allEntries;
}
If it helps anybody. I then convert it to string, then manually remove the extra header and closing lines. What a pain, I must be doing something wrong...

This seems to be working for me, and it's much simpler than those answers provided:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
document = builder.parse(new File(filename));
document.getDocumentElement().setAttributeNS("http://www.w3.org/2000/xmlns/", "xmlns:yourNamespace", "http://whatever/else");

How do I extract child element from XML to a string in Java?

If I have an XML document like
<root>
<element1>
<child attr1="blah">
<child2>blahblah</child2>
<child>
</element1>
</root>
I want to get an XML string with the first child element. My output string would be
<element1>
<child attr1="blah">
<child2>blahblah</child2>
<child>
</element1>
There are many approaches, would like to see some ideas. I've been trying to use Java XML APIs for it, but it's not clear that there is a good way to do this.
thanks

You're right, with the standard XML API, there's not a good way - here's one example (may be bug ridden; it runs, but I wrote it a long time ago).
import javax.xml.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.w3c.dom.*;
import java.io.*;
public class Proc
{
public static void main(String[] args) throws Exception
{
//Parse the input document
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File("in.xml"));
//Set up the transformer to write the output string
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty("indent", "yes");
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
//Find the first child node - this could be done with xpath as well
NodeList nl = doc.getDocumentElement().getChildNodes();
DOMSource source = null;
for(int x = 0;x < nl.getLength();x++)
{
Node e = nl.item(x);
if(e instanceof Element)
{
source = new DOMSource(e);
break;
}
}
//Do the transformation and output
transformer.transform(source, result);
System.out.println(sw.toString());
}
}
It would seem like you could get the first child just by using doc.getDocumentElement().getFirstChild(), but the problem with that is if there is any whitespace between the root and the child element, that will create a Text node in the tree, and you'll get that node instead of the actual element node. The output from this program is:
D:\home\tmp\xml>java Proc
<?xml version="1.0" encoding="UTF-8"?>
<element1>
<child attr1="blah">
<child2>blahblah</child2>
</child>
</element1>
I think you can suppress the xml version string if you don't need it, but I'm not sure on that. I would probably try to use a third party XML library if at all possible.

Since this is the top google answer and For those of you who just want the basic:
public static String serializeXml(Element element) throws Exception
{
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
StreamResult result = new StreamResult(buffer);
DOMSource source = new DOMSource(element);
TransformerFactory.newInstance().newTransformer().transform(source, result);
return new String(buffer.toByteArray());
}
I use this for debug, which most likely is what you need this for

I would recommend JDOM. It's a Java XML library that makes dealing with XML much easier than the standard W3C approach.

public String getXML(String xmlContent, String tagName){
String startTag = "<"+ tagName + ">";
String endTag = "</"+ tagName + ">";
int startposition = xmlContent.indexOf(startTag);
int endposition = xmlContent.indexOf(endTag, startposition);
if (startposition == -1){
return "ddd";
}
startposition += startTag.length();
if(endposition == -1){
return "eee";
}
return xmlContent.substring(startposition, endposition);
}
Pass your xml as string to this method,and in your case pass 'element' as parameter tagname.

XMLBeans is an easy to use (once you get the hang of it) tool to deal with XML without having to deal with the annoyances of parsing.
It requires that you have a schema for the XML file, but it also provides a tool to generate a schema from an exisint XML file (depending on your needs the generated on is probably fine).

If your xml has schema backing it, you could use xmlbeans or JAXB to generate pojo objects that help you marshal/unmarshal xml.
http://xmlbeans.apache.org/
https://jaxb.dev.java.net/

As question is actually about first occurrence of string inside another string, I would use String class methods, instead of XML parsers:
public static String getElementAsString(String xml, String tagName){
int beginIndex = xml.indexOf("<" + tagName);
int endIndex = xml.indexOf("</" + tagName, beginIndex) + tagName.length() + 3;
return xml.substring(beginIndex, endIndex);
}

You can use following function to extract xml block as string by passing proper xpath expression,
private static String nodeToString(Node node) throws TransformerException
{
StringWriter buf = new StringWriter();
Transformer xform = TransformerFactory.newInstance().newTransformer();
xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
xform.transform(new DOMSource(node), new StreamResult(buf));
return(buf.toString());
}
public static void main(String[] args) throws Exception
{
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputFile);
XPath xPath = XPathFactory.newInstance().newXPath();
Node result = (Node)xPath.evaluate("A/B/C", doc, XPathConstants.NODE); //"A/B[id = '1']" //"//*[#type='t1']"
System.out.println(nodeToString(result));
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Keep literal strings between elements xml - java

Related

Retrieve XML Element names with Java from unknown message format

Keep numeric character entity characters such as ` ` when parsing XML in Java

JDOM getChildren() returns empty list

Adding namespace to an already created XML document

How do I extract child element from XML to a string in Java?

Categories

Resources