Parsing an a false xml using jaxb - java

I have a situation where the xml(But its not really a xml data, instead a tag based custom data format) is send from a third party server(Because of that I cant change the format and coordinating with the third party is pretty difficult. The markup looks like as follows
<?xml version="1.0" encoding="UTF-8"?>
<result>SUCCESS</result>
<req>
<?xml version="1.0" encoding="UTF-8"?>
<Secure>
<Message id="dfgdfdkjfghldkjfgh88934589345">
<VEReq>
<version>1.0.2</version><pan>3453243453453</pan>
<Merchant><acqBIN>433274</acqBIN>
<merID>3453453245</merID>
<password>342534534</password>
</Merchant>
<Browser></Browser>
</VEReq>
</Message>
</Secure>
</req>
<id>1906547421350020</id>
<trackid>f68fb35c-cbc2-468b-aaf8-7b3f399b709d</trackid>
<ci>6</ci>
Now here I want only result, req, id, trackid and ci tags value as the parse output. Means after parsing I need req to contain all contents inside tags. One more point here is the req tag is embedd with another xml as it is not as a CDATA. I cant parse it using JAXB.
Can somebody have library that can parse all the content if I can configure the avialable tags in a file, or any other way. I really dont want to convert them to an object, even a hashmap with tag as a key and content as value is also fine. But I prefer the POJO model(Generating a class from this kind of xml).
Let me know if somebody can help me.

Make it well-formed XML first and the pass to whatever tool you find suitable. JAXB is not bad as it will ignore elements it does not know (apart from the root element).
And since most (if not all) tools expect well-formed XML anyway, you'll have to take care of turning your "false" XML into "true" XML first. I'd first try something like JTidy or JSoup ans see if they help to make your non-well-formed XML well-formed.
If it does not work I'd try to hack it on the lower-level SAX or StAX parsing. The XML you posted seems to suffer from two problems: no single root element and XML declaration in the body. I think both problems can be addressed with some minimal parser hacking.
And I think there is a special place in hell for people who invent this type non-wellformed XML. Damned to sit there and correct all the HTML documents on the Internet into valid XHTML by hand.

Related

Stax does not ready characters like "“" [duplicate]

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.
You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.
BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>

Transforming XML with Java

I was learning how to convert an XML file into a HTML using just Java, then later I decided to learn how to use the XSLT language to do the same.
By saying just java, I mean, using just the syntax of the Java language, that is, not XSLT language.
To clarify:
Loading XML into a DOM (using a DocumentBuilder).
Parsing it (just doing things like doc.getFirstChild()).
Writing it to a HTML file (just using a character stream, not a XML serialization).
What happened?
After I include the following line in my XML:
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
My Java application couldn't write the HTML right...
If I remove that, everything is right, but I want to keep it.
Any ideas how to ignore this "instruction"?
XSLT will ignore processing instructions (that is, remove them) by default. If you want to retain this one, just add a template rule to do so:
<xsl:template match="processing-instruction('xml-stylesheet')">
<xsl:copy/>
</xsl:template>
This assumes that your stylesheet is written in the classic recursive-descent style using apply-templates; if you're self-taught in XSLT then you might not have yet learnt this style. As always, it's much easier to help people when they show us the code.
It depends on how you are reading the XML from your Java application. But if your XML has an embedded Processing Instruction like
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
then it means that the stylesheet is an integral part of the data, and must be applied to the XML for it to be of any use. This is very similar to a CSS stylesheet processing instruction like, for example
<?xml-stylesheet type="text/css" href="standard.css"?>
which, in the same way, is an integral part of the XHTML, just as if it was an internal style within <style> tags.
It is clearly possible to read and use the XML without applying the stylesheet, but that is to ignore the directive of the data itself.
If you want to treat the XML as raw data and apply an optional transform to it in different ways then you must omit the processing instruction from the XML.
Sorry guys, I thought that the XML with the stylesheet.xsl was being "transformed" in the DOM object that I was using to parse the XML.
I made assumptions that:
The XML was being transformed before being put in the DOM.
The <?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?> was invisble in the DOM.
Basically I had a simple XML to start learning how to do the transformation. Something like the following:
<?xml version="1.0" encoding="UTF-8"?>
<items><item>...</item></items>
For simplicity (I was learning...) I decided to start my parsing with:
parse(doc.getFirstChild().getFirstChild()); //Expecting the first "item".
But after introducing the stylesheet to the XML the document became:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
<items><item>...</item></items>
Because of this introduction the doc.getFirstChild().getFirstChild() was not being a "item" anymore.
Then I just realize that I forgot to skip the node with this instruction (I really thought that it was "invisible" in the DOM tree).
Learning guys, learning...
P.S. That was my first attempt to transform a XML with XSLT!
Thank you for your help.

How to generate XSD from elements of XML

I have a XML input
<field>
<name>id</name>
<dataType>string</dataType>
<maxlength>42</maxlength>
<required>false</required>
</field>
I am looking for a library or a tool which will take an XML instance document and output a corresponding XSD schema.
I am looking for some java library with which I can generate a XSD for the above XML structure
If all you want is an XSD so that the XML you gave conforms to it, you'd be much better off by crafting it yourself rather than using a tool.
No one knows better than you the particularities of the schema, such as which valid values are there (for instance, is the <maxlength> element required? are true and false the only valid values for <required>?).
If you really want to use a tool (I'd only advice using it if you haven't designed the XML and really can't get the real XSD - or if you designed it, double check the generated XSD), you could try Trang. It can infer an XSD Schema from a number of example XML's.
You'll have to take into account that the XSD a tool can infer you might be incomplete or inaccurate if XML samples aren't representative enough.
java -jar trang.jar sampleXML.xml inferredXSD.xsd
You can find a usage example of Trang here.
You can try with online tool called XMLGrid: http://xmlgrid.net/xml2xsd.html
You could write an XSLT to do something like that. But the problem is, a single document alone is not enough information to generate a schema. Are any of those elements optional? Is there anything missing from that document, that might appear in other instances? How many of a particular element can there be? Do they have to be in that order? There are loads of things that can be expressed in a schema, that are not immediately obvious from one instance of a document that conforms to that schema.
For the people who really want to include it in their Java code to generate an XSD and understand the perils, check out Generate XSD from XML programatically in Java
Try xmlbeans it has some tools one of them is ins2xsd you can find specifics here:
http://xmlbeans.apache.org/docs/2.0.0/guide/tools.html
Good luck

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.
As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.
"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

how to match two xml string?

I am getting some information in the form of XML.
Before using that xml I want to validate that all the information is in that XML.
For this purpose I will have a master copy of XML, against which i will match all the coming documents.
How can i do this?
Looks a lot of work, but you could use XPath depending on the size and structure of your xml. Take a look at http://www.w3schools.com/xpath/default.asp.
Also there is a really good starting tutorial here:
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
And if you're willing to do validation through the use of your xsd (if there is one), look at ( XML Schema (XSD) validation tool? )
A common approach for validating xml is to define a schema (xsd or dtd). A parser can validate the input xml document against all constraints that are specified in the schema document.
This is a common way if you need to make sure, that certains elements are present and that certain values are within a specified value range.
You can refer the following link to see how an xml can be validated against a dtd in java
http://www.roseindia.net/xml/dom/DOMValidateDTD.shtml

Categories