So I've got a program that is reading in large XML files, which contain multiple entries of data. So the database I'm using it for originally contained 40,000 separate entries written in XML file, but you can download one XML file that contains all the entries. However, because of this, the XML declaration element:-
<?xml version="1.0" encoding="UTF-8"?>
is called multiple times throughout the document, and I was wondering whether there was some way of dealing with this through the use of StAX parser.
Edit: should of said that I can't properly parse through my document and read everything as it keeps returning the error:-
Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1062,6]
Message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
because of the fact that the xml declaration is stated multiple times.
Thanks
Until you eliminate the spurious <?xml ?> declaration(s), you cannot treat the file as XML because it is not well-formed. First treat it as text, either manually or programmatically, to eliminate the extra XML declarations before trying to parse it as XML.
For general information on all the ways the
The processing instruction target matching "[xX][mM][lL]" is not
allowed.
error arises and remedies for addressing each way, see this answer (as suggested by Stefan).
This line is called the XML prolog:
<?xml version="1.0" encoding="UTF-8"?>
The XML prolog is optional. If it exists, it must come first in the document.
It should not repeated anywhere else in the document.
Source : XMLProlog-W3Scools
Related
I want to update the XML but preserve the original processing instruction, most of the time it's just:
<?xml version="1.0" encoding="UTF-8"?>
However I can't find a way to extract the line from com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.JAXPSAXParser(and other XML reader) or how to automatically carry it to the write. Is there any other way than manually read the line, keep it then write it first before flushing the new XML ?
It's proper name is an XML declaration; it looks like a processing instruction but technically it isn't one.
Parsing invariably involves decoding the file (that is, converting the octets into characters); once that has been done, the theory goes, the application doesn't need to know how they were originally encoded. Similarly, when serializing the file, the application has to tell the serializer what encoding to use, and the serializer then takes responsibility for writing an XML declaration that reflects that encoding.
Allowing the application control over the XML declaration would break proper architectural layering, and would create the possibility of writing an XML declaration that is wrong. This bit of the content belongs to the parser layer, not to the application layer.
Of course in practice it's possible to get an XML declaration that doesn't match the actual encoding anyway, because there's nothing to stop you writing an XML declaration using software that knows nothing about XML. People do that, and they create broken content, and then they ask us on StackOverflow how to fix it. I'm not going to encourage you down that route.
I have an XML file which references an associated XSL file, like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="my-transform.xsl"?>
<my-root> .....
and I want to read it in as a org.w3c.dom.Document, applying the transform.
I'm considering reading it in, extracting the stylesheet processing-instruction using XPATH /processing-instruction('xml-stylesheet') and then loading the XSL file by hand and applying it with a Transformer.
But it seems odd that I need to do this manually - is there a neat way to read the file and apply the embedded transform automatically?
UPDATE: thanks to #raphaëλ for observing that TransformerFactory.getAssociatedStylesheet(...) will identify the xml-stylesheet value as a Source, which is pretty close. Is there anything more automatic than that?
Ok, nobody else answered, and I know the answer now. Stylesheets are not applied automatically. But you can get hold of the stylesheet using TransformerFactory.getAssociatedStylesheet(...), which will identify the xml-stylesheet value as a Source. You can then apply it manually.
Thanks to raphaëλ for pointing this out.
I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.
You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.
BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>
I have a situation where the xml(But its not really a xml data, instead a tag based custom data format) is send from a third party server(Because of that I cant change the format and coordinating with the third party is pretty difficult. The markup looks like as follows
<?xml version="1.0" encoding="UTF-8"?>
<result>SUCCESS</result>
<req>
<?xml version="1.0" encoding="UTF-8"?>
<Secure>
<Message id="dfgdfdkjfghldkjfgh88934589345">
<VEReq>
<version>1.0.2</version><pan>3453243453453</pan>
<Merchant><acqBIN>433274</acqBIN>
<merID>3453453245</merID>
<password>342534534</password>
</Merchant>
<Browser></Browser>
</VEReq>
</Message>
</Secure>
</req>
<id>1906547421350020</id>
<trackid>f68fb35c-cbc2-468b-aaf8-7b3f399b709d</trackid>
<ci>6</ci>
Now here I want only result, req, id, trackid and ci tags value as the parse output. Means after parsing I need req to contain all contents inside tags. One more point here is the req tag is embedd with another xml as it is not as a CDATA. I cant parse it using JAXB.
Can somebody have library that can parse all the content if I can configure the avialable tags in a file, or any other way. I really dont want to convert them to an object, even a hashmap with tag as a key and content as value is also fine. But I prefer the POJO model(Generating a class from this kind of xml).
Let me know if somebody can help me.
Make it well-formed XML first and the pass to whatever tool you find suitable. JAXB is not bad as it will ignore elements it does not know (apart from the root element).
And since most (if not all) tools expect well-formed XML anyway, you'll have to take care of turning your "false" XML into "true" XML first. I'd first try something like JTidy or JSoup ans see if they help to make your non-well-formed XML well-formed.
If it does not work I'd try to hack it on the lower-level SAX or StAX parsing. The XML you posted seems to suffer from two problems: no single root element and XML declaration in the body. I think both problems can be addressed with some minimal parser hacking.
And I think there is a special place in hell for people who invent this type non-wellformed XML. Damned to sit there and correct all the HTML documents on the Internet into valid XHTML by hand.
I was learning how to convert an XML file into a HTML using just Java, then later I decided to learn how to use the XSLT language to do the same.
By saying just java, I mean, using just the syntax of the Java language, that is, not XSLT language.
To clarify:
Loading XML into a DOM (using a DocumentBuilder).
Parsing it (just doing things like doc.getFirstChild()).
Writing it to a HTML file (just using a character stream, not a XML serialization).
What happened?
After I include the following line in my XML:
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
My Java application couldn't write the HTML right...
If I remove that, everything is right, but I want to keep it.
Any ideas how to ignore this "instruction"?
XSLT will ignore processing instructions (that is, remove them) by default. If you want to retain this one, just add a template rule to do so:
<xsl:template match="processing-instruction('xml-stylesheet')">
<xsl:copy/>
</xsl:template>
This assumes that your stylesheet is written in the classic recursive-descent style using apply-templates; if you're self-taught in XSLT then you might not have yet learnt this style. As always, it's much easier to help people when they show us the code.
It depends on how you are reading the XML from your Java application. But if your XML has an embedded Processing Instruction like
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
then it means that the stylesheet is an integral part of the data, and must be applied to the XML for it to be of any use. This is very similar to a CSS stylesheet processing instruction like, for example
<?xml-stylesheet type="text/css" href="standard.css"?>
which, in the same way, is an integral part of the XHTML, just as if it was an internal style within <style> tags.
It is clearly possible to read and use the XML without applying the stylesheet, but that is to ignore the directive of the data itself.
If you want to treat the XML as raw data and apply an optional transform to it in different ways then you must omit the processing instruction from the XML.
Sorry guys, I thought that the XML with the stylesheet.xsl was being "transformed" in the DOM object that I was using to parse the XML.
I made assumptions that:
The XML was being transformed before being put in the DOM.
The <?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?> was invisble in the DOM.
Basically I had a simple XML to start learning how to do the transformation. Something like the following:
<?xml version="1.0" encoding="UTF-8"?>
<items><item>...</item></items>
For simplicity (I was learning...) I decided to start my parsing with:
parse(doc.getFirstChild().getFirstChild()); //Expecting the first "item".
But after introducing the stylesheet to the XML the document became:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
<items><item>...</item></items>
Because of this introduction the doc.getFirstChild().getFirstChild() was not being a "item" anymore.
Then I just realize that I forgot to skip the node with this instruction (I really thought that it was "invisible" in the DOM tree).
Learning guys, learning...
P.S. That was my first attempt to transform a XML with XSLT!
Thank you for your help.