DOM parse xml file without converting entity references - java

I am writing a parser for an xml file which will contains special characters, for example
<name>You & me ®</name>
The dom parser will, by default, parse this value to "You & me ®".
However what I want the string is
You & me ®
Is there any way I can do this?
Thanks

If you are using DOM for parsing, see the DocumentBuilderFactory.setExpandEntityReferences() method.
By default, this setting is true meaning that entities are expanded out automatically. If you turn this off, you will be able to read the entities from the DOM - in this case you won't just get one big text node from a parent element, but you will get text nodes interleaved with entity nodes.

Related

How to replace a section of a string in document with another string while parsing a document

I have a parsed PDF document that I parsed using a library in java.
The problem is that the tables in the document are not parsed properly, it is parsed like text(line by line). So I used a library in Python called Camelot that gave me the parsed table format and i sent this to java. I need to replace the PDF parsed tables with the one from Camelot and keep the remaining intact. There are multiple tables in a document and so the parsed tables return as a list of string with each index giving the parsed value of each table.
The boundaryEND tag represent the end of each table in the attached image of the Camelot Output.
I tried using streams by calling the allmatch() method but couldn't replace the section since allmatch() gives boolean() value (only indicates whether the strings match or not and not give the elements as such).The output from the camelot is this
Java parsed pdf
This could be done via Stream APIs using custom collector. Check this link : Split a list into sublists based on a condition with Stream api

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

Escaping an xml string in java

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf

Transforming an XML attribute into a valid HTML attribute value?

I'm using Java 6 on a Tomcat 6.0.33 application server. I'm getting XML that I must render as a form element. The XML I receive looks like
<pquotec type='input' label='Price Quote Currency' nwidth='200' vlength='10'>
XYZ
</pquotec>
and below is the desired output.
<label for="#valid_id_goes_here">Price Quote Currency</label>
<input type="text" size="10" style="width:200px;" value="XYZ" name="#valid_name_goes_here#" id="#valid_id_goes_here#" />
My question is, what is a strategy for transforming the value stored in the XML element's label attribute to something I can replace "#valid_name_goes_here#" above with? Preferably the strategy would allow me to translate back again. Note that things that appear within "" may not necessarily be suitable for values for id and name.
Thanks for your help, - Dave
The name attribute of the input element is defined as having type CDATA, which basically means "any character data", so I think there shouldn't really be a problem.
If you do encounter a validity issue, you could convert any 'awkward' (or simply all) characters to their encoded form. E.g. é would become é.
USE XSLT - Heres an example that converts XML to HTML, but it is trivial to convert XML to XML as well.
In java Xalan can do XSLT, and this thread might also help you.
In case you want to do the XML Parsing and render the target HTML using JSP refer to this thread for a list of XML Parsers
EDIT:
Hmmm, you could have written the question without XML & HTML fragments, and asked simply how to convert any string into a valid HTML Id, and back again.
Use the data- attributes HTML to store the original incoming string. Then use regex to extract valid characters from the incoming string, replacing all invalid characters with underscore, and use that as ID. There is a small chance that you may get duplicate IDs. In that case you can always go back and make the XML come in a way that does not have duplicates.
This way you can get back the original string and have the Valid IDs

What does Java Node normalize method do?

I'm doing some tests, but I see no difference when I use or not the normalize() method.
But the examples at ExampleDepot website use it.
So, what is it for? (The documentation wasn't clear for me either)
You can programmatically build a DOM tree that has extraneous structure not corresponding to actual XML structures - specifically things like multiple nodes of type text next to each other, or empty nodes of type text. The normalize() method removes these, i.e. it combines adjacent text nodes and removes empty ones.
This can be useful when you have other code that expects DOM trees to always look like something built from an actual XML document.
This basically means that the following XML element
<foo>hello
wor
ld</foo>
could be represented like this in a denormalized node:
Element foo
Text node: ""
Text node: "Hello "
Text node: "wor"
Text node: "ld"
When normalized, the node will look like this
Element foo
Text node: "Hello world"
It cleans code from adjacent Text nodes and empty Text nodes
there are a lot of possible DOM trees that correspond to the same XML structure and each XML structure has at least one corresponding DOM tree. So conversion from DOM to XML is surjective.
So it may happen that:
dom_tree_1 != dom_tree_2
# but:
dom_tree_1.save_DOM_as_XML() == dom_tree_2.save_DOM_as_XML()
And there is no way for ensuring:
dom_tree == dom_tree.save_DOM_as_XML().load_DOM_from_XML()
But we would like to have it bijective. That means each XML structure corresponds to one particular DOM tree.
So you can define a subset of all possible DOM trees that is bijective to the set of all possible XML structures.
# still:
dom_tree.save_DOM_as_XML() == dom_tree.normalized().save_DOM_as_XML()
# but with:
dom_tree_n = dom_tree.normalize()
# we now even have:
dom_tree_n == dom_tree_n.save_DOM_as_XML().load_DOM_from_XML().normalize()
So normalized DOM trees can be perfectly reconstructed from their XML representation. There is no information loss.
Normalize the root element of the XML document. This ensures that all Text nodes under the root node are put into a "normal" form, which means that there are neither adjacent Text nodes nor empty Text nodes in the document.

Categories