Transforming an XML attribute into a valid HTML attribute value? - java

I'm using Java 6 on a Tomcat 6.0.33 application server. I'm getting XML that I must render as a form element. The XML I receive looks like
<pquotec type='input' label='Price Quote Currency' nwidth='200' vlength='10'>
XYZ
</pquotec>
and below is the desired output.
<label for="#valid_id_goes_here">Price Quote Currency</label>
<input type="text" size="10" style="width:200px;" value="XYZ" name="#valid_name_goes_here#" id="#valid_id_goes_here#" />
My question is, what is a strategy for transforming the value stored in the XML element's label attribute to something I can replace "#valid_name_goes_here#" above with? Preferably the strategy would allow me to translate back again. Note that things that appear within "" may not necessarily be suitable for values for id and name.
Thanks for your help, - Dave

The name attribute of the input element is defined as having type CDATA, which basically means "any character data", so I think there shouldn't really be a problem.
If you do encounter a validity issue, you could convert any 'awkward' (or simply all) characters to their encoded form. E.g. é would become é.

USE XSLT - Heres an example that converts XML to HTML, but it is trivial to convert XML to XML as well.
In java Xalan can do XSLT, and this thread might also help you.
In case you want to do the XML Parsing and render the target HTML using JSP refer to this thread for a list of XML Parsers
EDIT:
Hmmm, you could have written the question without XML & HTML fragments, and asked simply how to convert any string into a valid HTML Id, and back again.
Use the data- attributes HTML to store the original incoming string. Then use regex to extract valid characters from the incoming string, replacing all invalid characters with underscore, and use that as ID. There is a small chance that you may get duplicate IDs. In that case you can always go back and make the XML come in a way that does not have duplicates.
This way you can get back the original string and have the Valid IDs

Related

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

xml error Element or attribute do not match QName production: QName::=(NCName':')?NCName

Here is my XML:
As suggested by many posts for the resolution, checked the closing tags..maybe I am still missing out something.Please help.
Also this is the SOAP request XML which I am taking as source to create the above xml:
Why am I doing this? I need to validate the request XML with an inline schema from a WSDL file, So I extracted the inline schema and created an XSD file.Now I need a request XML to validate against my already created XSD file.
Welcome to Stack Overflow, and to the use of XML.
You don't say exactly what the question is, but I guess it's something along the lines of "what is going wrong here?" or "what is wrong with this XML?"
The data you show is not XML, because it's not well-formed. Consider the string <xmlns:ejb3="http://ejb3.examples.itko.com/">. I guess this is intended to be a start-tag.
In XML, a start-tag begins and ends with angle brackets, and within those has an element type name followed by zero or more attribute-value specifications or namespace declarations, separated from each other and from the element type name by whitespace.
If we take the string xmlns:ejb3="http://ejb3.examples.itko.com/" as a namespace declaration (as it is, in the source from which you say you copied your data), then your problem is that your start-tag does not give any element type name. (And a secondary problem is that the string </xmlns:ejb3> at the end of the data stream looks like it's trying to be an end-tag, but it's using a namespace-attribute name where it needs to be using an element type name.)
If on the other hand we take <xmlns:ejb3 as an angle bracket followed by an element type name, your problems are that (a) the element type name begins with the reserved string 'xml', which is now allowed by the XML spec, and (b) the element type name is followed not by a closing angle bracket or by a blank and an attribute-value specification, but by an equals sign and a quoted string -- it looks like a fragmentary attribute-value specification lacking the attribute name.
Five minutes with a tutorial on the basics of XML should enable you to avoid such problems. XML syntax is quite simple, compared to a lot of alternatives, but it is not something anyone is born knowing, and your experience shows that you are not managing to pick it up from examples without spending any time at all actually studying it.

Data type for a HTML Page in JAVA

Which data type should be used to store an HTML page in JAVA?
Recommendation: Store the page in a (Jsoup) Document.
pros:
you can parse those documents from string / file / website in a single line
all entities are escaped (and can be unescaped)
pretty printing
you get a string out of it with a single line of code - a html string as well as a text only one
you can easily select / modify your html
html is "cleaned"
...
see: http://jsoup.org/
But some more informations about what you want to do would be helpful ...
Without knowing what you will do with it, I'd suggest a
java.lang.String
because that's what it actually is. A character string.

Escaping an xml string in java

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf

Regular expression for removing HTML tags from a string

I am looking for a regular expression to removing all HTML tags from a string in JSP.
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i";
The HTML tag might be complete, partial (without closing tag) or without proper starting tag (missing closing angle bracket in 3rd example) itself.
Thanks in advance
Case 3 is not possible with regex or a parser. It might represent legitimate content. So forget it.
As to the concrete question which covers cases 1 and 2, just use a HTML parser. My favourite is Jsoup.
String text = Jsoup.parse(html).text();
That's it. It has by the way also a HTML cleaner, if that is what you're actually after.
Since you're using JSP, you could also just use JSTL <c:out> or fn:escapeXml() to avoid that user-controlled HTML input get inlined among your HTML (which may thus open XSS holes).
<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags will then not be interpreted, but just displayed as plain text.
<\/?font(\s\w+(\=\".*\")?)*\>
I used this little gem about a week ago to strip a variety of 12-year-old html tags, and it worked pretty great. Just replace 'font' with whatever tag you're looking for, or with \w* to get rid of all of them.
Edit removed '?' from the end of my string after realizing that could remove non-tag data from a file. Basically, this will consistently find case 1 and 2, but if used with case 3 (with the '?' appended to the end of the regex), caution should be used to ensure what is removed is a tag.

Categories