Reading an xml that contains errors in JAVA

Reading an xml that contains errors in JAVA - java

I have an xml that contains a lot of the same type of xmls.
I use DOM to read them.
My question is like this:
If I get an exception in reading one of the inside xml(like an extra "<" in a field),
how do I move to get the next xml?
The code is written in java using dom and XMLStreamReader

Related

Parse a large XML file and get the duplicate attributes

I have a large XML file. It is structured like below:
...
<LexicalEntry id="tajaAhul_$axoS_1">
<Lemma partOfSpeech="n" writtenForm="تجاهُل شخْص"/>
<Sense id="tajaAhul_$axoS_1_<homaAl_$axoS_n1AR" synset="<homaAl_$axoS_n1AR"/>
<WordForm formType="root" writtenForm="جهل"/>
</LexicalEntry>
...
The file has been created automatically, so it may contain a duplicate writtenForm. I want to parse it with JAVA to check if there is really a duplicate writtenForm and if so I want to get them. With JAVA, the more I read about parsing XML files the more I get confused! I found that if the file is a large one, I should use SAX Parser but I am not familiar with all his functions and methods and I also found that with SAX Parser, I should create all the work in some handler class.

Since you mentioned your XML is large, the best option to parse is the SAX parser as you have already found out. It's not as scary as you assume. It reads through your XML content and calls your "Handler" to handle what it "sees" in the XML. Your handler class will be the one that will 'capture' and structure the XML content. Because it reads 'through' your XML it doesn't consume memory to store the content of XML. There are many examples out there on SAX parsing but this could be a starter example. Good luck!

modifying xml document using xml parsers?

I have an xml stored in database table. i need to get the xml and modify few elements and put the xml back in the database.
I am thinking to use JDOM or JAXB to modify the xml elements. Could you please suggest which one is better regarding the performance?
Thanks!

JAXB and JDOM and completely different things. JAXB will serialize java objects into an XML format and vice versa. JDOM simply reads in the XML file and stores it in a DOM tree which can then be used to modify the xml itself. So better if you go for JDOM.

JAXB is to be used when you have objects where the attribute values are stored in XML hence you can parse an xml document and it gives you a java objects and then you can write these back.
Quite a bit of work if you want to simple change some values. And it doesn't work with arbitrary xml files, JAXB has it's own format linked to your object's definitions.
JDOM creates also objects but the objects used are XML objects like Element, NodeList, ...
If you just want to change some values -> why not reading the xml file as a plain text file and use string operations to make your changes.
Or of the modification is more logicaly defined -> use an XSLT and a stylesheet translator.
Googling for XSLT and Java will give you tons of examples.

Conversion from one form of XML to another form of XML

I am trying to convert an XML file of one particular format i.e. with with one set of tags to another set of tags and when I am printing the data between the tags in the new file, it is getting repeated 100s of times. I am getting the new tags and everything but the data is getting repeated many times. For example: If the sentence is "Hello", Hello is getting repeated many times.
I am using SAX parser to parse the old XML file and Node class using appendChild to put the contents into the new file.
Kindly help me with this! I'll provide the code.
Thank you!

I recommend using XSLT rather than a XML parser to transform XML.
Having said that, there is most likely something wrong with your SAX callbacks to cause something like that to happen. I can't be more specific without seeing your actual code.

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.

As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.

"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

Is it possible to use Apache Digester to filter dynamic xml leaf tags?

I've used Apache digester before and loved the branch based searching of xml tags.
Specifying a tag as
h\a\b\
is very intuitive.
Now i want to do xml filtering project, but apache digester doesn't seem like it will work, simply because there is no way to get to the underlying xml tags. As the faq says:
How do I get some xml nested within a tag as a literal string?
It is frequently asked how some XML (esp. XHTML) nested within a document can be extracted as a string, eg to extract the contents of the "body" tag below as a string:
...some xml code...
If you can modify the above to wrap the desired text as a CDATA section then things are easy; Digester will simply treat that CDATA block as a single string:
...some xml code...
If this can't be done then you need to use a NodeCreateRule to create a DOM node representing the body tag and its children, then serialise that DOM node back to text.
Remember that Digester is just a layer on top of a standard XML parser, and standard XML parsers have no option to just stop parsing input at a specific element - unless it knows that the contents of that element is a block of characters (CDATA).
If there was something that uses the same pattern system that i can use to filter xml? My idea is to use the patterns given by the user and blacklist them, and copy everything else.
Or maybe there is a way to find the location of a match in Apache Digester (the location on the xml, not just the displayed text). That would be enough for me to copy the other text by keeping a copy of it around, and skipping the matches.
Edit: I've since found out that XPath looks almost ok for doing this, but all applications i found were for selecting something, not removing it. Do you have a example for this?

Never mind, managed to do it with XPath.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.