Reading XML document nodes containing special characters (&, -, etc) with Java

Reading XML document nodes containing special characters (&, -, etc) with Java - java

My code does not retrieve the entirety of element nodes that contain special characters.
For example, for this node:
<theaterName>P&G Greenbelt</theaterName>
It would only retrieve "P" due to the ampersand. I need to retrieve the entire string.
Here's my code:
public List<String> findTheaters() {
//Clear theaters application global
FilmhopperActivity.tData.clearTheaters();
ArrayList<String> theaters = new ArrayList<String>();
NodeList theaterNodes = doc.getElementsByTagName("theaterName");
for (int i = 0; i < theaterNodes.getLength(); i++) {
Node node = theaterNodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
//Found theater, add to return array
Element element = (Element) node;
NodeList children = element.getChildNodes();
String name = children.item(0).getNodeValue();
theaters.add(name);
//Logging
android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);
//Add theater to application global
Theater t = new Theater(name);
FilmhopperActivity.tData.addTheater(t);
}
}
return theaters;
}
I tried adding code to extend the name string to concatenate additional children.items, but it didn't work. I'd only get "P&".
...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
name += children.item(j).getNodeValue();
}
Thanks for your time.
UPDATE:
Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!

The & is an escape character in XML. XML that looks like this:
<theaterName>P&G Greenbelt</theaterName>
should actually be rejected by the parser. Instead, it should look like this:
<theaterName>P&G Greenbelt</theaterName>
There are a few such characters, such as < (<), > (>), " (") and ' (&apos;). There are also other ways to escape characters, such as via their Unicode value, as in • or 〹.
For more information, the XML specification is fairly clear.
Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn't what's actually there, and it's how the data is represented in the tree.
For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.

The file you are trying to read is not valid XML. No self-respecting XML parser will accept it.
I'm retrieving my XML dynamically from the web. What's the best way to replace all my escape characters after fetching the Document object?
You are taking the wrong approach. The correct approach is to inform the people responsible for creating that file that it is invalid, and request that they fix it. Simply writing hacks to (try to) fix broken XML is not in your (or other peoples') long term interest.
If you decide to ignore this advice, then one approach is to read the file into a String, use String.replaceAll(regex, replacement) with a suitable regex to turn these bogus "&" characters into proper character entities ("&"), then pass the "fixed" XML string to the XML parser. You need to carefully design the regex so that it doesn't break valid character entities as an unwanted side-effect. A second approach is to do the parsing and replacement by hand, using appropriate heuristics to distinguish the bogus "&" characters from well-formed character entities.
But this all costs you development and test time, and slows down your software. Worse, there is a significant risk that your code will be fragile as a result of your efforts to compensate for the bad input files. (And guess who will get the blame!)

You need to either encode it properly or wrap it in a CDATA section. I'd recommend the former.

The numeric character references "<" and "&" may be used to escape < and & when they occur in character data.
All XML processors MUST recognize these entities whether they are declared or not. For interoperability, valid XML documents SHOULD declare these entities, like any others, before using them. If the entities lt or amp are declared, they MUST be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is REQUIRED for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they MUST be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is OPTIONAL but harmless). For example:
<!ENTITY lt "&#60;">
<!ENTITY gt ">">
<!ENTITY amp "&#38;">
<!ENTITY apos "'">
<!ENTITY quot """>

Related

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.

You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());

Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

Can the Java streaming XML parser distinguish empty element from self-closing empty element?

Can the Java streaming XML parser, i.e. javax.xml.stream.XMLEventReader distinguish an empty element
<document>
<empty></empty>
<document>
from a self-closing empty element?
<document>
<empty/>
<document>
Let's suppose we parse both of the above xml fragments and print the eventType and the event itself, just like this:
System.out.println("eventType:" + event.getEventType() + "; element:"+event.toString());
Both of the above fragments will produce the exact same result:
eventType:7; element:<?xml version="null" encoding='null' standalone='no'?>
eventType:1; element:<document>
eventType:4; element:
eventType:1; element:<empty>
eventType:2; element:</empty>
eventType:2; element:</document>
eventType:8; element:ENDDOCUMENT
Just to give some context, what we want to achieve is, we want to rewrite some parts of the xml based on some rules, but want to preserve other parts exactly as they are, that is, we want to keep empty elements in their original form, even though the two forms are semantically the same. If we have a normal empty element (1st example), we want to keep it that way, if we have a self-closing empty element, we want to write a self-closing element in the result as well. Can we achieve this goal with javax.xml.stream.XMLEventReader?

The answer is no. Similarly, you can't preserve whitespace within a tag (e.g. newlines between attribute values, or spaces around the "=" sign). These are considered to be of no interest to applications, and are therefore not reported.

You could test if the startevent and endevent have the same location
event.getLocation().getCharacterOffset();
From the javadoc
Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. Returns -1 if there is no offset available.
The offset is not guaranteed to be available, but that should depend on your setup and worth a try if it works in yours. (Also it can only represent offsets up to Integer.MAX_VALUE)

How can I remove HTML tags in Java?

I need to remove the HTML tags from the following string in java
String text = "<html><head></head><body>hi x>a and y<b and z>c</body></html>";
I can do this with regular expressions. But it also removes the "b and z" in the string. Because it is consider this as tag.

Of course it will remove "b and z". It is supposed to remove that text. Because in HTML attributes do not have to be quoted and they do not need values. So b is an element and and and z are attributes (without values). That is what an HTML parser would recognize.
Of course, and and z and not really acceptable attributes for the b element, but in terms of syntactic well-formedness you should recognize the b as an element.
If you did not want that removed, you need to write your < as <. That is how to write correct HTML anyway. :)
ADDENDUM
(Yes I am aware of the famous "can't parse HTML with a regex" answer cited above in the comment, but the < vs < in the question was worth pointing out in an answer, IMHO.)

Java: Ignoring escapes when parsing XML

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
A previous similar question, Read escaped quote as escaped quote from xml, received one answer that seems to be specific to Apache, and another that appears to simply not not do what it says it does. I'd love to be proven wrong on either count, however :)
For reference, here is some code:
file = new File(fileName);
DocBderFac = DocumentBuilderFactory.newInstance();
DocBder = DocBderFac.newDocumentBuilder();
doc = DocBder.parse(file);
NodeList textElmntLst = doc.getElementsByTagName(text);
Element textElmnt = (Element) textElmntLst.item(0);
NodeList txts = textElmnt.getChildNodes();
String txt = ((Node) txts.item(0)).getNodeValue();
System.out.println(txt);
I would like that println() to produce things like
"3>2"
instead of
"3>2"
which is what currently happens.
Thanks!

You can turn them back into xml-encoded form by
StringEscapeUtils.escapeXml(str);
(javadoc, commons-lang)

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
Bad requirement. Don't do that.
Or at least consider carefully why you think you want or need it.
CDATA sections and escapes are a tactic for allowing you to pass text like quotes and '<' characters through XML and not have XML confuse them with markup. They have no meaning in themselves and when you pull them out of the XML, you should accept them as the quotes and '<' characters they were intended to represent.

One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.

Both good answers, but both a little too heavy-weight for this very small-scale application. I ended up going with the total hack of just stripping out all &s (I do this to &s that aren't part of escapes later anyway). It's ugly, but it's working.
Edit: I understand there's all kinds of things wrong with this, and that the requirement is stupid. It's for a school project, all that matters is that it work in one case, and the requirement is not my fault :)

Use RegExp to replace XML tags with whitespaces (in the length of the tags)

I need to strip all xml tags from an xml document, but keep the space the tags occupy, so that the textual content stays at the same offsets as in the xml. This needs to be done in Java, and I thought RegExp would be the way to go, but I have found no simple way to get the length of the tags that match my regular expression.
Basically what I want is this:
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
Matcher m = p.matcher(stringWithXMLContent);
String strippedContent = m.replaceAll("THIS IS A STRING OF WHITESPACES IN THE LENGTH OF THE MATCHED TAG");
Hope somebody can help me to do this in a simple way!

Since < and > characters always surround starting and ending tags in XML, this may be simpler with a straightforward statemachine. Simply loop over all characters (in some writeable form - not stored in a string), and if you encounter a < flip on the "replacement mode" and start replacing all characters with spaces until you encounter a >. (Be sure to replace both the initial < and the closing >).
If you care about layout, you may wish to avoid replacing tab characters and/or newline characters. If all you care about is overall string length, that obviously won't matter.
Edit: If you want to support comments, processing instructions and/or CData sections, you'll need to explicitly recognize these too; also, attribute values unfortunately can include > as well; all this means a full-fledged implementation will be more complex that you'd like.
A regular transducer would be perfect for this task; but unfortunately those aren't exactly commonly found in class libraries...

Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
In the spirit of You Can't Parse XML With Regexp, you do know that's not an adequate pattern for arbitrary XML, right? (It's perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)
I have found no simple way to get the length of the tags that match my regular expression.
Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.
StringBuffer b= new StringBuffer();
while (m.find()) {
String spaces= StringUtils.repeat(" ", m.end()-m.start());
m.appendReplacement(b, spaces);
}
m.appendTail(b);
stringWithXMLContent= b.toString();
(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)

Why not use an xml pull parser and simply echo everything that you want to keep as you encounter it, e.g. character content and whenever you reach a start or end tag find out the length using the name of the element, plus any attributes that it has and write the appropriate number of spaces.
The SAX API also has callbacks for ignoreable whitespace. So you can also echo all whitespace that occurs in your document.

Maybe m.start() and m.end() can help.
m.start() => "The index of the first character matched"
m.end() => "The offset after the last character matched"
(m.end() - m.start())-2 and you know how many /s you need.

**string**.replaceAll("(</?[a-zA-Z]{1}>)*", "")
you can also try this. it searches for <, then / 0 or 1 occurance then followed by characters only 1 (small or capital char), then followed by a > , then * for multiple occurrence of this pattern.
:)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.