Stax does not ready characters like "“" [duplicate] - java

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.

You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.

BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>

Related

Parse a large XML file and get the duplicate attributes

I have a large XML file. It is structured like below:
...
<LexicalEntry id="tajaAhul_$axoS_1">
<Lemma partOfSpeech="n" writtenForm="تجاهُل شخْص"/>
<Sense id="tajaAhul_$axoS_1_<homaAl_$axoS_n1AR" synset="<homaAl_$axoS_n1AR"/>
<WordForm formType="root" writtenForm="جهل"/>
</LexicalEntry>
...
The file has been created automatically, so it may contain a duplicate writtenForm. I want to parse it with JAVA to check if there is really a duplicate writtenForm and if so I want to get them. With JAVA, the more I read about parsing XML files the more I get confused! I found that if the file is a large one, I should use SAX Parser but I am not familiar with all his functions and methods and I also found that with SAX Parser, I should create all the work in some handler class.
Since you mentioned your XML is large, the best option to parse is the SAX parser as you have already found out. It's not as scary as you assume. It reads through your XML content and calls your "Handler" to handle what it "sees" in the XML. Your handler class will be the one that will 'capture' and structure the XML content. Because it reads 'through' your XML it doesn't consume memory to store the content of XML. There are many examples out there on SAX parsing but this could be a starter example. Good luck!

how to surpasss certain pattern with out reading it from BufferedInputstream

how to surpasss certain pattern with out reading it ? i have
so using SAX parser throwing "The processing instruction target matching "[xX][mM][lL]" is not allowed." how can i skip next xml pattern (mark(), reset()) and move current position after that ?``
You can only use a SAX parser to read XML, and the error message you're receiving indicates that your data is not XML. Specifically, XML may only have at most one XML declaration (<?xml ... ?>), and it may only appear at the top of the file.
Before you can read this data via SAX, you'll have to repair it by fixing this problem with the XML declaration(s). You can do this manually in a text editor or programmatically by reading the file as text, not parsing it as XML via SAX because it's not XML.
For more details on how to eliminate XML declaration problems such as this, see the comprehensive answer given for this Stack Overflow question:
Error: The processing instruction target matching
“[xX][mM][lL]” is not allowed*

Parsing an a false xml using jaxb

I have a situation where the xml(But its not really a xml data, instead a tag based custom data format) is send from a third party server(Because of that I cant change the format and coordinating with the third party is pretty difficult. The markup looks like as follows
<?xml version="1.0" encoding="UTF-8"?>
<result>SUCCESS</result>
<req>
<?xml version="1.0" encoding="UTF-8"?>
<Secure>
<Message id="dfgdfdkjfghldkjfgh88934589345">
<VEReq>
<version>1.0.2</version><pan>3453243453453</pan>
<Merchant><acqBIN>433274</acqBIN>
<merID>3453453245</merID>
<password>342534534</password>
</Merchant>
<Browser></Browser>
</VEReq>
</Message>
</Secure>
</req>
<id>1906547421350020</id>
<trackid>f68fb35c-cbc2-468b-aaf8-7b3f399b709d</trackid>
<ci>6</ci>
Now here I want only result, req, id, trackid and ci tags value as the parse output. Means after parsing I need req to contain all contents inside tags. One more point here is the req tag is embedd with another xml as it is not as a CDATA. I cant parse it using JAXB.
Can somebody have library that can parse all the content if I can configure the avialable tags in a file, or any other way. I really dont want to convert them to an object, even a hashmap with tag as a key and content as value is also fine. But I prefer the POJO model(Generating a class from this kind of xml).
Let me know if somebody can help me.
Make it well-formed XML first and the pass to whatever tool you find suitable. JAXB is not bad as it will ignore elements it does not know (apart from the root element).
And since most (if not all) tools expect well-formed XML anyway, you'll have to take care of turning your "false" XML into "true" XML first. I'd first try something like JTidy or JSoup ans see if they help to make your non-well-formed XML well-formed.
If it does not work I'd try to hack it on the lower-level SAX or StAX parsing. The XML you posted seems to suffer from two problems: no single root element and XML declaration in the body. I think both problems can be addressed with some minimal parser hacking.
And I think there is a special place in hell for people who invent this type non-wellformed XML. Damned to sit there and correct all the HTML documents on the Internet into valid XHTML by hand.

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.
As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.
"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

how to match two xml string?

I am getting some information in the form of XML.
Before using that xml I want to validate that all the information is in that XML.
For this purpose I will have a master copy of XML, against which i will match all the coming documents.
How can i do this?
Looks a lot of work, but you could use XPath depending on the size and structure of your xml. Take a look at http://www.w3schools.com/xpath/default.asp.
Also there is a really good starting tutorial here:
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
And if you're willing to do validation through the use of your xsd (if there is one), look at ( XML Schema (XSD) validation tool? )
A common approach for validating xml is to define a schema (xsd or dtd). A parser can validate the input xml document against all constraints that are specified in the schema document.
This is a common way if you need to make sure, that certains elements are present and that certain values are within a specified value range.
You can refer the following link to see how an xml can be validated against a dtd in java
http://www.roseindia.net/xml/dom/DOMValidateDTD.shtml

Categories