Navigating an XML file while keeping track of order - java

I need to convert an XML file in the IOB format.
The XML file represents the structure of a Latex-written paper, i.e. with sections and subsections. In this representation, sections are encoded as BODY, then I have a HEADER and then paragraphs or subsections.
Example:
<DIV DEPTH="1">
<HEADER ID="H-8"> Practical Results </HEADER>
<P TYPE="TXT">
<S ID="S-56" TYPE="TXT"> To assess its performance , <REF REFID="R-12" ID="C-36">Grover et al. 1993</REF> tried various methods . </S>
<S ID="S-57" TYPE="TXT"> The grammar is defined in metagrammatical formalism which is compiled into a unification-based ` object grammar ' -- a syntactic variant of the Definite Clause Grammar formalism <REF REFID="R-21" ID="C-37">Pereira and Warren 1980</REF> -- containing 84 features and 782 phrase structure rules . </S>
<DIV DEPTH="2">
<HEADER ID="H-9"> Comparing the Parsers </HEADER>
<P TYPE="TXT">
<S ID="S-61" TYPE="TXT"> In the first experiment , the ANLT grammar was loaded and a set of sentences was input to each of the three parsers . </S>
</P>
<IMAGE ID="I-0"/>
</DIV>
What I want to do is to keep all the text, but convert it to a different format, i.e. I want to remove the BODY structure, and just tag the HEADERs and the text part like this:
Practical/B-Header Results/I-Header ./O
To/B-Text assess/I-Text its/I-Text performance/I-Text ,/I-Text Grover/I-Text et/I-Text al./I-Text tried/I-Text various/I-Text methods/I-Text ./O
The/B-Text grammar/I-Text ... ./O
And so on.
I know some DOM parsing in Java (for example, I have been using jdom2 for a little while) but I do not know how to keep the order of the text: for example, I want to fetch the content of the REF tag (which is inside S, look at the example), but the text from its parent extends before and after the REF tag.
Any pointers? Should be fairly simple, but searches like "strip XML tags after certain depth" did not help me :-(

I would use an event based xml parser like sTax, sax etc. Then you can keep track of levels, order and other things as you process each tag.

Related

XStream - How can I treat HTML tags as content, rather than fields?

I've got some XML that contains HTML tags in the content of a particular field, abstract, for example:
<abstract>
<p>
Kringles are autonomous structural domains, found throughout the blood clotting and fibrinolytic proteins. Kringle domains are believed to play a role in binding mediators (e.g., membranes, other proteins or phospholipids), and in the regulation of proteolytic activity [<cite idref="PUB00002414"/>, <cite idref="PUB00001541"/>, <cite idref="PUB00003257"/>].
</p>
<p>
Kringle domains [<cite idref="PUB00003400"/>, <cite idref="PUB00000803"/>, <cite idref="PUB00001620"/>] are characterised by a triple loop, 3-disulphide bridge structure, whose conformation is defined by a number of hydrogen bonds and small pieces of anti-parallel beta-sheet. They are found in a varying number of copies in some plasma proteins including prothrombin and urokinase-type plasminogen activator, which are serine proteases belonging to MEROPS peptidase family S1A.
</p>
</abstract>
I can't for the life of me figure out how to tell XStream to simply ignore these HTML and cite tags (not "ignore" in the sense of "omit," which is what XStream.ignoreUnknownElements() or XStream.omitField() do).
I'd like to just pull the entire abstract content into a string, including the HTML and cite tags.
Appreciate any suggestions!!!

Compare two XML attributes with same name in two different XML files using Java

I have an XML with attributes and elements/tags.
I want to know whether using an attribute or a tag is good according to performance.
Could you please give an example to compare if the content has a child tag and also if the content has a attribute.
My question is, is it possible to compare 2 attributes with same name in 2 different XML files and also here we will have huge data.
So, I want to be sure how the performance is, if i consider it as a attribute or tag.
<A Name="HRMS">
<B BName="IN">
<C Code="0001">
<IN irec="200" />
<OUT orec="230" Number="" Outname=""/>
</C>
<C Code="0004">
<IN irec="209" />
<OUT orec="209" Number="" Outname=""/>
</C>
<C Code="0008">
<IN irec="250" />
<OUT orec="250" Number="" Outname=""/>
</C>
</B>
</A>
Here, i have to compare irec with orec for a particular B name and C code
It's possible. You need a java lib like jsoup to help parse xml by path expression like jquery css selection expression.
Jsoup is a HTML parser, but html is a kind of xml application, so you can use it to parse xml content.
jsoup example:
String xml = "<root><person name=\"Bob\"><age>20</age></person></root>";
Document root = Jsoup.parse(xml);
System.out.println(root.body().html());//origin XML content
Elements persons = root.getElementsByTag("person");
Element person = persons.first();
System.out.println("The attribute 'name' of Person:" + person.attr("name"));
System.out.println(persons.select("person[name=Bob]").first().text());
You can implement compare difference function using jsoup simplily.

Parsing XML files to get particular text content

I am parsing the XML files which represents research papers / artciles and have below XML schema to store in a MYSQL database in Java
<article>
<article-meta></article-meta>
<body>
<p>
Extensible Markup Language (XML) is a markup language that defines a set of
rules for encoding documents in a format that is both human-readable and machine-
readable <ref id = 1>. It is defined in the XML 1.0 Specification produced by the
W3C, and several other related specifications
</p>
<p>
Many application programming interfaces (APIs) have been developed to aid
software developers with processing XML <ref id = 2>. data, and several schema
systems exist to aid in the definition of XML-based languages.
</p>
</body>
<back>
<ref-list>
<ref id = 1>Details about this reference </ref>
<ref id = 2>Details about this reference </ref>
</ref-list>
</back>
</article>
I am parsing the files using DOM parser . One of the requirements is for every ref id , i have to extract 150 characters form left and right from the location where it is referred in the body tags. How can I do this ??
refId leftText rightText
1 left 150 150 chars on right side
Assuming you got the <ref> tag element Id = 1 and element content value = Details about this reference from xml in your code using dom, storing <ref> tag content value in a string variable then you can use sub string method to get left char and right char like this.
String text ="Details about this reference";
String leftText = text.substring(0,7); // get 7 chars from left side
String rightText =text.substring(text.length()-2); // get 2 char from right side, instead of 2 you have to pass10
result
leftText:Details rightText:ce
Note: you need to check string length grater than 150 before extracting it, if less than substring will throw exception ArayIndexBoundOfException

java,Xml parsing

I want to parse like this xml file in java.
I know using SAX or DOM, we can parse XML file.
But as per my knowledge if xml is like this
<XML><FORM><ITEM>Name</ITEM> <ITEM>Area</ITEM> <ITEM>ZipCode</ITEM> </FORM></XML>
we can parse it.
How can i get other properties like label, title,type. as in this XML file.
How to do that?
Please help me!!!
<XML><FORM TITLE="Search" View="1"><ITEM Label="Name" Type="Alpha" maxWidth="25"ID="name" Align="LEFT"></ITEM> <ITEM Label="Area " Type="AlphaNumeric" maxWidth="20" ID="area" Align="LEFT"></ITEM> <ITEM Label="Zip Code" Type="Numeric" maxWidth="10" ID="zip" Align="LEFT"></ITEM><ITEM Label="Search within radius of" Type="Radio" maxWidth="20" ID="ID" Align=" CENTER"><LIST_VALUES><ID="20" VALUE="20 kms"<ID="50" VALUE="50 kms"> <ID="100" VALUE="20 kms"></LIST_VALUES></ITEM></FORM><XML>
TITLE, Label, etc are ATTRIBUTES of the various XML Nodes. Depending on the exact api you're using you'd generally find a method named something like getAttribute(String name) which would allow you to retrieve the content of the particular attribute.
I remember JDOM allowing you to call getAttribute(..) on an instance of an Element, which mapped to an XML node
Here is an exemple using DOM, once you have your ITEM element (for instance):
String labelValue= itemElement.getAttribute("Label");

parsing XML that contain XML in elements, Can this be done

I have a 'complex item' that is in XML, Then a 'workitem' (in xml) that contains lots of other info, and i would like this to contain a string that contains the complex item in xml.
for example:
<inouts name="ClaimType" type="complex" value="<xml string here>"/>
However, trying SAX and other java parsers I cannot get it to process this line, it doesn't like the < or the "'s in the string, I have tried escaping, and converting the " to '.
Is there anyway around this at all?? Or will I have to come up with another solution?
Thanks
I think you'll find that the XML you're dealing with won't parse with a lot of parsers since it's invalid. If you have control over the XML, you'll at a bare minimum need to escape the attribute so it's something like:
<inouts name="ClaimType" type="complex" value="<xml string here>" />
Then, once you've extracted the attribute you can possibly re-parse it to treat it as XML.
Alternatively, you can take one of the approaches above (using CDATA sections) with some re-factoring of your XML.
If you don't have control over your XML, you could try using the TagSoup library to parse it to see how you go. (Disclaimer: I've only used TagSoup for HTML, I have no idea how it'd go with non-HTML content)
(The tag soup site actually appears down ATM, but you should be able to find enough doco on the web, and downloads via the maven repository)
Possibly the easiest solution would be to use a CDATA section. You could convert your example to look like this:
<inouts name="ClaimType" type="complex">
<![CDATA[
<xml string here>
]]>
</inouts>
If you have more than one attribute you want to store complex strings for, you could use multiple child elements with different names:
<inouts name="ClaimType" type="complex">
<value1>
<![CDATA[
<xml string here>
]]>
</value1>
<value2>
<![CDATA[
<xml string here>
]]>
</value2>
</inouts>
Or multiple value elements with an identifying id:
<inouts name="ClaimType" type="complex">
<value id="complexString1">
<![CDATA[
<xml string here>
]]>
</value>
<value id="complexString2">
<![CDATA[
<xml string here>
]]>
</value>
</inouts>
CDATA section or escaping
NB There is a big difference between escaping and encoding, which some other posters have referred to. Be careful of confusing the two.
I'm not sure how it works for attributes, and if escaping (< as < and > as >) does not work, then I don't know.
If it were an inner tag: you could use the Xml Any mechanism (never used it myself) or declare it in a CDATA section.
you are http://www.doingitwrong.com/
If inouts/#value really is tree-structured (i.e. XML) then it shouldn't be an attribute, it should be a child element:
<inout name="ClaimType" type="complex">
<value>
<some-arbitrary>
<xml-stuff/>
</some-arbitrary>
</value>
</inout>
If it is not, in fact, guaranteed to be well-formed XML, but just sort of looks like it because you put some pointy brackets in it, then you should ask yourself if there isn't some better way to solve this problem. That failing, use <![CDATA[ as some have already suggested.

Categories