Java: Ignoring escapes when parsing XML

Java: Ignoring escapes when parsing XML - java

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
A previous similar question, Read escaped quote as escaped quote from xml, received one answer that seems to be specific to Apache, and another that appears to simply not not do what it says it does. I'd love to be proven wrong on either count, however :)
For reference, here is some code:
file = new File(fileName);
DocBderFac = DocumentBuilderFactory.newInstance();
DocBder = DocBderFac.newDocumentBuilder();
doc = DocBder.parse(file);
NodeList textElmntLst = doc.getElementsByTagName(text);
Element textElmnt = (Element) textElmntLst.item(0);
NodeList txts = textElmnt.getChildNodes();
String txt = ((Node) txts.item(0)).getNodeValue();
System.out.println(txt);
I would like that println() to produce things like
"3>2"
instead of
"3>2"
which is what currently happens.
Thanks!

You can turn them back into xml-encoded form by
StringEscapeUtils.escapeXml(str);
(javadoc, commons-lang)

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
Bad requirement. Don't do that.
Or at least consider carefully why you think you want or need it.
CDATA sections and escapes are a tactic for allowing you to pass text like quotes and '<' characters through XML and not have XML confuse them with markup. They have no meaning in themselves and when you pull them out of the XML, you should accept them as the quotes and '<' characters they were intended to represent.

One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.

Both good answers, but both a little too heavy-weight for this very small-scale application. I ended up going with the total hack of just stripping out all &s (I do this to &s that aren't part of escapes later anyway). It's ugly, but it's working.
Edit: I understand there's all kinds of things wrong with this, and that the requirement is stupid. It's for a school project, all that matters is that it work in one case, and the requirement is not my fault :)

Related

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)

When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.

Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.

As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

SAX Parser, new lines being removed after parsing, can I keep them in? ( )

I'm unsure on how complicated or easy this issue might be since this is my first project with XML parsing and therefore using the SAX parser.
When pulling down XML from a server, I can see in the XML there are a few 
 in the text, a carriage return character, or new line. After the XML has gone through the SAX parser it comes out the other end without those characters.
It comes out as a string and when I add it to a TextView it's just one big block of text, which is obviously not something I want.
Is there a way for me to parse the XML and still keep the 
 characters? My idea was to do a String.replace("
", "\n"); right before adding it to the TextView, therefore making use of new lines.
If there is a better way I would love to know but just being able to use 
 would be helpful too.
Thanks in advance for any advice you can offer.

XML parsers are required to normalize line endings, so CR and CRLF sequences are all turned into LF (CR=x0D, LF=x0A) before presentation to the application.
If you want to retain the CR characters (why??) you should represent them as numeric character references, that is

Before starting the parser I am converting my HTTP result into a string, so before I parse I do a String.replace("
", "
");
After parsing new lines show up correctly. I guess the parser ignores carriage return in favour or new lines. Which makes sense I guess.
Hopefully this helps someone down the line.

How to make SAXParser ignore escape codes

I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format.
I have managed to get round most obstacles that this format throws up except when it encounters text containing the &. The XLM file represents this ampersand as & and I can only manage to read the text following the & in any particular section of text.
Is there a way to disable detection of escape codes? I am using SAXParser.

There is something fishy about what you are trying to do.
If the file format you are trying to parse contains bare ampersand (&) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g. &) in well-formed XML.
If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.
If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.
Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.
It would appear that your real problem is that your characters(...) callback is being called separately for the text before the &, for the (decoded) &, and finally for the text after the &. You simply have to have to deal with this by joining the text chunks back together.
The javadoc for ContentHandler.characters() says this:
"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...".

It's probably not the best general solution for escape characters, but I only had to take into account new lines so it was easy to just check for \n.
You could check for the backslash \ only to check for all escape characters or in your case &, although I think others will come with more elegant solutions.
#Override
public void characters(char[] ch, int start, int length)
{
String elementData = new String(ch, start, length);
boolean elementDataContainsNewLine = (elementData.indexOf("\n") != -1);
if (!elementDataContainsNewLine)
{
//do what you want if it is no new line
}
}

Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of <name>[key]</name><string>[value]</string> must make the XML inventors puke.
You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment ElementTree or BeautifulSoup which are very robust.
Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as & to the characters they represent &, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)

I am parsing the below string using SAXParser
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
</xml>
I want the same string to be retained but it is getting converted to below
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
<xml>
Here is my code. How can I avoid this being converted?
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
MyHandler handler = new MyHandler();
values = handler.getValues();
saxParser.parse(x, handler);

javax Transformer preserve the escaping entity

I finally gave up the transformation of xslt after the failure of several tryings (if you are interested, you can check my question here: don't want xslt intepret the entity characters). And now I'm working on the javax Transformer trying to solve the problem.
The problem is that I would like to keep the escaping of apostrophe ' in the xml and html:
< cell i='0' j='0' vi='0' parity='Odd' subType='-1'> "String'</cell>
output is what I don't want:
< td nowrap="true" class="gridCell" i="0" j="0">"String'< /td>
I would like the result as follows:
< td nowrap="true" class="gridCell" i="0" j="0">"String'< /td>
I don't know whether we could use the method transform to do this, and I see a similar question, but he needs the opposite thing : How Do You Prevent A javax Transformer From Escaping Whitespace?
I appreciate any help from you, thank you!

Strictly speaking, there's no difference to an XML parser between an ', ', &#x27 or &apos; in a text node, In an attribute it's slightly different, given that the value has to be enclosed between " or '; if you've enclosed an attribute's value in ', you MUST use an entity to specify an apostrophe within the value, and an XML serializer will do the same.
I'm a bit rusty with XML handling in Java, but I know in C# you can have your transform generate an XML document object, and create a class that inherits from the XmlWriter class to serialize it however you wish, by overriding the 'WriteString' method. Hopefully there's something similar you can do in Java, possibly by implementing the Result interface, or perhaps passing a DOMResult in that you can work with. I can't remember how you normally serialize a Document in java, but it should be something you can manipulate or override.

XML defines &apos; and ' to mean the same thing, so most XML tools are going to treat the distinction as irrelevant. I think you should question the requirement - why are you trying to do this? It's a bit like trying to preserve the spaces around the "=" sign in an attribute: it's pointless.

Regex exclusion behavior

Ok, so I know this question has been asked in different forms several times, but I am having trouble with specific syntax. I have a large string which contains html snippets. I need to find every link tag that does not already have a target= attribute (so that I can add one as needed).
^((?!target).)* will give me text leading up to 'target', and <a.+?>[\w\W]+?</a> will give me a link, but thats where I'm stuck. An example:
<a href="http://www.someSite.com>Link</a> (This should be a match)
Link (this should not be a match).
Any suggestions? Using DOM or XPATH are not really options since this snippet is not well-formed html.

You are being wilfully evil by trying to parse HTML with Regexes. Don't.
That said, you are being extra evil by trying to do everything in one regexp. There is no need for that; it makes your code regex-engine-dependent, unreadable, and quite possibly slow. Instead, simply match tags and then check your first-stage hits again with the trivial regex /target=/. Of course, that character string might occur elsewhere in an HTML tag, but see (1)... you have alrady thrown good practice out of the window, so why not at least make things un-obfuscated so everyone can see what you're doing?

If you insist on doing it with Regex a pattern such as this should help...
<a(?![^>]*target=) [^>]*>.*?</a>
It's by no means 100% perfect technically speaking a tag can contain a > in places other than then end so it won't work for all HTML tags.
NB. I work with PHP, you may have to make slight syntax adjustments for Java.

You could try a negative lookahead like this:
<a(?!.*?target.*?).*?>[\w\W]+?</a>

I didn't test this and spent about a minute writing it, but for your specific example if you can do it on the client-side, try this via the DOM:
var links = document.getElementsByTagName("a");
for (linkIndex=0; linkIndex < links.length; linkIndex++) {
var link = links[linkIndex];
if (link.href && !link.target) {
link.target = "someTarget"
// or link.setAttribute("target", "someTarget");
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Ignoring escapes when parsing XML - java

You can turn them back into xml-encoded form by StringEscapeUtils.escapeXml(str); (javadoc, commons-lang)

One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.

Related

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

SAX Parser, new lines being removed after parsing, can I keep them in? ( )

How to make SAXParser ignore escape codes

javax Transformer preserve the escaping entity

Regex exclusion behavior

Categories

Resources