Let's say there were errors in an XML message:
Well-Formed
<Person><Name>Attila</Name><ID>001</ID><Age>45</Age></Person>
Not Well-Formed
<Pxxxon><Name>Attila</9327><ID>001</ID><Age>45</Age></Person>
Are there any Java libraries or code to format the non-well formed XML message to:
<Pxxxon>
<Name>Attila</9327>
<ID>001</ID>
<Age>45</Age>
</Person>
I understand that current Java libraries only format valid XML messages to this prettified format.
No, because what you list as "Invalid" is actually not well-formed.
Well-formed and valid are not the same thing.
Well-formed means that a textual object meets the W3C requirements for being XML.
Valid means that well-formed XML meets additional requirements given by a specified schema.
See Well-formed vs Valid XML for more details, but if data is not well-formed, it is not XML at all and no XML parser will be able to read it to reformat it.
You might then ask what about non-XML parsers? To which we would reply, if it's not XML, what format is it? For any parser to be able to read any data, the syntax of the data has to be defined. Simply saying that the data resembles XML insufficiently specifies the format, and that is why you'll not find a tool that can pretty-print the data sample you provided.
Related
I am trying to parse an XML file using SAX parser.But when it finds an & it gives me an error "The entity name must immediately follow the '&' in the entity reference.".How can i make the parser to ignore '&' while parsing or if possible to convert it into & from the DTD itself
Your input is not valid XML, since it seems to contain & characters which are not followed by an entity name or character reference.
The cleanest way to solve this is to make sure that the input is valid XML before you parse it, i.e. replace the offending & characters with &.
I don't think you can convince any decent XML parser to silently ignore XML syntax errors.
Find the person/entity responsible for producing the invalid XML input
Make sure that person/entity never again in his/her/its life will ever be capable of producing invalid XML again
Repeat for any new offender
Use of unnecessary violence in the apprehension of the XML villains HAS been approved
Or, you can just resign and use TagSoup or something similar.
I have a problem parsing a XML file which contains special characters like ", <, > or & in attributes of an element. At the moment I use XMLReader with an own ContentHandler. Unfortunatel changing the XML is not an option since I get a huge bunch of files. Any idea what I could do??
Best!
You have to change the XML in order to make it well-formed. The five magic characters must be encoded properly OR wrapped in a CDATA section to tell the parser to allow them to pass.
If the five magic characters are not encoded properly, you aren't receiving well-formed XML. That ought to be the foundation of your contract with users.
Do a one-shot change.
It's not XML. Don't call it XML, because you are misleading yourself. You're dealing with a proprietary data syntax, and you are missing out on all the benefits of using XML for data interchange. You can't use any of the wonderful tools that exist for processing XML, because your data is not XML. You're in the dark ages of data interchange that existed before XML was invented, where everyone had to write their own parsers and port them to multiple platforms, at vast cost. It may be expensive to switch from this mess to the modern world of open standards, but the investment will pay off quickly. Just don't let any of the stakeholders delude themselves into thinking that because your syntax is "almost XML", you are almost there in terms of reaping the benefits. XML is all or nothing.
It's not best practice, but you could use regex to transform your almost-XML into proper XML before you open it with XMLReader. Something along these lines (just using javascript for a quick proof-of-concept):
var xml = '<root><node attr="bad attr chars...<"&>..."/></root>';
xml = xml.replace(/("[^"]*)&([^"]*")/, '$1&$2')
xml = xml.replace(/("[^"]*)<([^"]*")/, '$1<$2')
xml = xml.replace(/("[^"]*)>([^"]*")/, '$1>$2')
xml = xml.replace(/("[^"]*)"([^"]*")/, '$1"$2')
alert(xml);
I have written a xml parser which successfully parses a xml file which is given as input.But sometime the input file that is given to may parser has double quote in a text property because of which my parser crashes.
Eg
<tag myprop=" this has a extra quote here like " some times" > </tag>
I know the tag that may /may not have the extra quote.I use a dom parser.
How can i handle this situation?
You won't be able to use an XML parser until you have actual XML. What you currently have is invalid (ie not XML). You should escape the quote-mark inside the attribute beforehand.
The escaped code would look like:
<tag myprop=" this has a extra quote here like " some times" > </tag>
As to why your parser crashes, well there are dozens of XML libraries in existence - have you looked at any of those? I would personally expect to receive a ParseException or something like that.
I don't know for sure, but I think it's just invalid XML and so your parser should fail gracefully (rather than crashing) but I don't think it should successfully parse such a file.
You can not. That is not a valid XML, so the DOM parser will fail to parse.
see XML 1.0 specification, section 2.4:
http://www.w3.org/TR/xml/#attdecls
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos ; ", and the double-quote character (") as "" ;"."
so, since it's not valid XML your parser shouldn't be trying to handle the invalid value, it just needs to give an error.
Is there any way to use an xsd file to validate input of a string?
I have found some examples of xsd being used to validate an xml file, but what I really want is to just use one element of the xsd to validate some user input.
Is there a simple way to do this or should I just treat the xsd file as an xml file, extract the element and compare it to the given string to see if it's valid?
Thanks
If I'm understanding your question correctly, you typically use jaxb along with an xsd(schema) to validate an xml file not the contents of a node in an xml file. You may be better off using xpath to parse the xml file and get the contents of the particular node and then do your comparison that way.
Here is a link to one of the jaxb tutorials and a linkt to an XPATH tutorial.
I'm attempting to apply a stylesheet to an XML document using Saxon. Given an XML file that was generated in Microsoft Word and that has Microsoft Word-style quotes, such as around FOO in the following document
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<act>
<performer typeCode=“FOO“ />
<performer typeCode="BAR" />
</act>
</doc>
Saxon throws the following error:
SXXP0003: Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.
What is the best way to handle these type of "special" characters in XML that were intended to be valid but break in actual parsing/transformation?
Since the above is not valid XML, you will have to do some preprocessing of the input (say with a FilterReader), as just about any XML parser will indicate an error (and typically a fatal error, so you cannot handle the error and continue).
If the special quotes are only in the xml you can do a simple replace of the special quotes with plain quotes (a little more work if you have to check the preamble for the encoding type). If you want to keep special quotes elsewhere in the document you will have to do something a bit more complicated (mostly keep track as to whether you are in a tag or not).
trouble is those 'special' quotes are not valid xml. Saxon or any other xml parser is going to throw that stuff out and not parse the document.
Only thing I can suggest is a search and replace for those and replace them with the expected quotes.