how to surpasss certain pattern with out reading it from BufferedInputstream - java

how to surpasss certain pattern with out reading it ? i have
so using SAX parser throwing "The processing instruction target matching "[xX][mM][lL]" is not allowed." how can i skip next xml pattern (mark(), reset()) and move current position after that ?``

You can only use a SAX parser to read XML, and the error message you're receiving indicates that your data is not XML. Specifically, XML may only have at most one XML declaration (<?xml ... ?>), and it may only appear at the top of the file.
Before you can read this data via SAX, you'll have to repair it by fixing this problem with the XML declaration(s). You can do this manually in a text editor or programmatically by reading the file as text, not parsing it as XML via SAX because it's not XML.
For more details on how to eliminate XML declaration problems such as this, see the comprehensive answer given for this Stack Overflow question:
Error: The processing instruction target matching
“[xX][mM][lL]” is not allowed*

Related

Parse a large XML file and get the duplicate attributes

I have a large XML file. It is structured like below:
...
<LexicalEntry id="tajaAhul_$axoS_1">
<Lemma partOfSpeech="n" writtenForm="تجاهُل شخْص"/>
<Sense id="tajaAhul_$axoS_1_<homaAl_$axoS_n1AR" synset="<homaAl_$axoS_n1AR"/>
<WordForm formType="root" writtenForm="جهل"/>
</LexicalEntry>
...
The file has been created automatically, so it may contain a duplicate writtenForm. I want to parse it with JAVA to check if there is really a duplicate writtenForm and if so I want to get them. With JAVA, the more I read about parsing XML files the more I get confused! I found that if the file is a large one, I should use SAX Parser but I am not familiar with all his functions and methods and I also found that with SAX Parser, I should create all the work in some handler class.
Since you mentioned your XML is large, the best option to parse is the SAX parser as you have already found out. It's not as scary as you assume. It reads through your XML content and calls your "Handler" to handle what it "sees" in the XML. Your handler class will be the one that will 'capture' and structure the XML content. Because it reads 'through' your XML it doesn't consume memory to store the content of XML. There are many examples out there on SAX parsing but this could be a starter example. Good luck!

Stax does not ready characters like "“" [duplicate]

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.
You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.
BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>

Reading an xml that contains errors in JAVA

I have an xml that contains a lot of the same type of xmls.
I use DOM to read them.
My question is like this:
If I get an exception in reading one of the inside xml(like an extra "<" in a field),
how do I move to get the next xml?
The code is written in java using dom and XMLStreamReader

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.
As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.
"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

Is it possible to use Apache Digester to filter dynamic xml leaf tags?

I've used Apache digester before and loved the branch based searching of xml tags.
Specifying a tag as
h\a\b\
is very intuitive.
Now i want to do xml filtering project, but apache digester doesn't seem like it will work, simply because there is no way to get to the underlying xml tags. As the faq says:
How do I get some xml nested within a tag as a literal string?
It is frequently asked how some XML (esp. XHTML) nested within a document can be extracted as a string, eg to extract the contents of the "body" tag below as a string:
...some xml code...
If you can modify the above to wrap the desired text as a CDATA section then things are easy; Digester will simply treat that CDATA block as a single string:
...some xml code...
If this can't be done then you need to use a NodeCreateRule to create a DOM node representing the body tag and its children, then serialise that DOM node back to text.
Remember that Digester is just a layer on top of a standard XML parser, and standard XML parsers have no option to just stop parsing input at a specific element - unless it knows that the contents of that element is a block of characters (CDATA).
If there was something that uses the same pattern system that i can use to filter xml? My idea is to use the patterns given by the user and blacklist them, and copy everything else.
Or maybe there is a way to find the location of a match in Apache Digester (the location on the xml, not just the displayed text). That would be enough for me to copy the other text by keeping a copy of it around, and skipping the matches.
Edit: I've since found out that XPath looks almost ok for doing this, but all applications i found were for selecting something, not removing it. Do you have a example for this?
Never mind, managed to do it with XPath.

Categories