How to properly parse XML with SAX? - java

I am receiving an XML document from a REST service which shall be parsed using SAX. Please see the following example which was generated out of the XSD.
Setting up the parser is not a problem. My main problem is the actual processing in the startElement(), endElement() methods etc. I don't understand how to extract the items I need and store them as they are somewhat "nested".
Example
The ConnectionList can occur once or twice and may contain any number of Connection elements which -in turn- have details about a connection. Basically, I need a list of all connections with their Date, Transfers and Time. Do I have to create one class per element?
As far as I got it I somehow need to do the following:
If the parser comes across a...
ConnectionList: Create new ConnectionList object and put it into a list of ConnectionLists
Connection: Create a new Connection object and put it into a list of Connections
Date, Transfers, Time (only if parent is Duration): Store the node value in the current Connection object
I'd really appreciate any help, hint, idea, snippet how I can achieve this.
Thanks :-)
Robert
<?xml version="1.0" encoding="UTF-8"?>
<ResC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Err code="r5E5a1Wm" text="tk-gWYbw" level="E"/>
<Err code="takVDd34" text="XtvyjmjPuscK" level="E"/>
<Err code="hQ1-:aDQ" text="YWc5qtY.gkwCeJW2S" level="E"/>
<ConRes dir="R">
<Err code="ZfwPC:tj" text="RKKFuLXoM0oOfp3a" level="E"/>
<Err code="bhDjSJPa" text="BJoHuOMdwzhcddW" level="E"/>
<Err code="CX-NhK9r" text="j55qy-WiNPXu" level="E"/>
<ConResCtxt b="1" f="1">0815</ConResCtxt>
<ConnectionList type="IV">
<Err code="WI3WX.jo" text="rK3H5jwa-Zfen3" level="E"/>
<Connection id="ID000">
<Overview>
<Date>b3lcM_Yiyq7dqL9</Date>
<Departure>
<BasicStop type="NORMAL" index="-1086549314">
<Address externalId="t.EdKe93xkqFqLwPzgd-4vHSJemy8"
externalStationNr="1332105793" name="fdREYJPu83WV503V8szdCX"
x="951177990" y="-1579782776" z="1807457957" type="WGS84"/>
</BasicStop>
</Departure>
<Arrival>
<BasicStop type="NORMAL" index="1897526979">
<Address externalId="l7h_GTUit6fv" externalStationNr="-1670310329"
name="WJznDTzkTvyET51pfr7X" x="-1738098662" y="-170353174"
z="-475585957" type="WGS84"/>
</BasicStop>
</Arrival>
<Transfers>dZbgZfDH8j1hb1i</Transfers>
<Duration>
<Time>00d00:18:00</Time>
</Duration>
<ServiceDays> </ServiceDays>
<Products>
<Product cat="qmrN2dShHJp"/>
<Product cat="Hg"/>
<Product cat="nurxhdl3w.P0x7FRv2J3UoF"/>
</Products>
<ContextURL url="http://FzgEqiVC/"/>
</Overview>
</Connection>
<Connection id="ID004">
<Overview>
<Date>W5a47DRkc7XDZjhwq_s5Un.</Date>
<Departure>
<BasicStop type="NORMAL" index="-1014429844">
<Address externalId="RMnzjEFOTTdM1oaAUw" externalStationNr="1429101638"
name="HF-1" x="1005198487" y="570832676" z="975615566" type="WGS84"
/>
</BasicStop>
</Departure>
<Arrival>
<BasicStop type="NORMAL" index="-58308182">
<Address externalId="rVdwdQvAukfj2QcA7b3OSdGOyW"
externalStationNr="1142334006" name="g" x="-1791416159"
y="-541300941" z="478129823" type="WGS84"/>
</BasicStop>
</Arrival>
<Transfers>GG56XN6zgiJF804mE_N4o</Transfers>
<Duration> </Duration>
<ServiceDays> </ServiceDays>
<Products>
<Product cat="fs_Oyoy9NYBai-qaxbty6j9Y7r1St"/>
<Product cat="P2CbaSGpC"/>
<Product cat="CGZrqSIDM6M4kUlb8_xZ8jRlH4c"/>
</Products>
<ContextURL url="http://JkRhuXtu/"/>
</Overview>
</Connection>
</ConnectionList>
<ConnectionList type="IV">
<Err code="0lFWRY2X" text="KLmdczFRhV" level="E"/>
<Connection id="ID012">
<Overview>
<Date>t8mn634zjCZsRPyxj_e_-UYMH</Date>
<Departure>
<BasicStop type="NORMAL" index="-2095085423">
<Address externalId="ftKAFG-Uk7x" externalStationNr="1390920810"
name="JQrQXOQbm.FLaCMeSiTYjT" x="1970142849" y="-655980297"
z="2102464970" type="WGS84"/>
</BasicStop>
</Departure>
<Arrival>
<BasicStop type="NORMAL" index="1552118247">
<Address externalId="qcBpeuPDRzvSt1o" externalStationNr="-1133118359"
name="AJiJOB1t" x="-1422533132" y="-1158953133" z="484831466"
type="WGS84"/>
</BasicStop>
</Arrival>
<Transfers>D0MiUwW9nuuM_uykvawg2C07pwHL</Transfers>
<Duration> </Duration>
<ServiceDays> </ServiceDays>
<Products>
<Product cat="LpGOZbLDbJm"/>
<Product cat="JIv-szQVX2icPb"/>
<Product cat="Q7-pthWoOT"/>
</Products>
<ContextURL url="http://zGWgivvi/"/>
</Overview>
<IList>
<I header="ze4Wt3hVD-DvjujY6QKae" text="lVwB4RxAHcYq3.F"
uriCustom="iVjQJCoU1MVOv2Z9lwarP"/>
<I header="z-i.au59soMzXLZCbV" text="PoTP" uriCustom="ksrbwEH6scNR"/>
<I header="N" text="jHDA4" uriCustom="ub95811lMIa_495ZbPOuNWL0rRWh"/>
</IList>
<CommentList>
<Comment id="ID013">
<Text lang="EN"> </Text>
<Text lang="FR"> </Text>
<Text lang="PL"> </Text>
</Comment>
<Comment id="ID014">
<Text lang="DK"> </Text>
<Text lang="IT"> </Text>
<Text lang="IT"> </Text>
</Comment>
<Comment id="ID015">
<Text lang="MACRO"> </Text>
<Text lang="IT"> </Text>
<Text lang="EN"> </Text>
</Comment>
</CommentList>
</Connection>
</ConnectionList>
</ConRes>
</ResC>

The best way I've found (so far) of parsing XML using SAX is to use a stack and conditional statements in the relevant callbacks. Here's an article describing it, and my summary of it:
The basic premise is that as you parse the document, you create objects to store the parsed data, pushing them onto the stack as you go, peeking at the top of the stack to add data to the current element, and at the end of each element popping it off the stack and storing it in the parent.
The effect is that you parse the tree of elements depth first, and at the end of each branch you roll it back into the parent until you're left with a single object (such as your ConnectionList) that contains all of the parsed data ready to be used. Essentially, you end up with a series of objects that mirror the structure of the original XML
That means you need some data objects that can store the data in the same structure as the XML. Complex elements will normally become classes, while simple elements will normally be attributes within classes. The root element is often represented by a list of some kind.
To start with, you create a stack object to hold the data as you parse it.
Then, at the start of each element you identify what type it is using localName.equals() method, create an instance of the appropriate class, and push it into the Stack. If the element is a simple element, you will probably model that as an attribute in the class representing the parent element, and you will need a series of flags that tells the parser if such an element is encountered and what element it is so it can be processed in the characters() method.
The actual data is read using the characters() method, and again you use conditional logic to determine what to do with the data, based on the value of the flag. Essentially, you peek at the top of the stack and use the appropriate method to write the data into the object, converting from text where necessary.
At the end of each element, you pop the top of the stack and use localName.equals() again to determine how to store it in the object before it (e.g. which setter method needs to be called)
When you reach the end of the document you should have captured all the data in the document.

Your SAX event handler should act as a state machine. Your structure is pretty deep, so the state machine will be a bit complex; but this is the basic approach:
All variables are member variables.
When you encounter a startElement event, you instantiate an object representing that element then put the object on a stack (or set a flag indicating what value you are working with).
When you encounter a text event, read the text and set the appropriate value based on the flag you set in the previous step.
When you encounter a endElement event, you pull the current object off the stack and call the setter on the object that is now on the top of the stack.
When you exhaust the document, you should only have one object left on the stack which represents everything you've read.

SAX parsers are a bit like looking at a large picture through a tiny spy hole.
The callback will present you with a single piece of the XML structure at a time. It wont give you any clues as to where you are in the document only a single piece of data is presented,. The element name, the attribute name/value or the text contents.
Your program needs to track where you are in the document. If you are parsing on the fly a simple stack structure will do -- you push the name onto the stack when you get a "beginelement" and you pop the stack on an "endelement".
If you find yourself building a tree structure I would switch to a DOM parser as whatever you write will be a pale and buggy shadow of something like XERCES.

If it's a reasonably small xml document and the memory/throughput constraints aren't prohibitive to an in memory solution, then you could use JAXB instead. You can generate the required classes from the XSD and simply unmarshall the xml into java objects. If you must use a streaming parser, then consider using StAX instead, I generally find this more intuitive.

Generally speaking you have a couple choices:
Use custom objects to map the XML to, these objects will encapsulate more objects much like the XML elements nest.
Do a generic parsing and traverse the DOM via relative elements.
To my knowledge there are some tools out there such as JAXB which will generate your classes based on XSD's, but they can sometimes come with a price as generated code often does.
If you go with option 1 and "roll your own" you'll need to provide methods for unmarshaling and marshaling that go to and from XML and most likely Strings. Something like:
<Foo>
<Bar>
<Baz></Baz>
</Bar>
<Thing></Thing>
</Foo>
// pseudo-code!
//In Foo.java
unmarshal( Element element ) {
unmarshalBar( element );
unmarshalThing( element );
}
unmarshalBar( Element element ) {
//...verify that the element is bar
bar = new Bar();
bar.unmarshal( element );
}
//In Bar.java
unmarshal( Element element ) {
unmarshalBaz( element );
}
Hope this helps.

I usually put objects on a stack, and push/pop them while parsing the XML file (particularly useful if objects are nested, but that's not your case).
If you want a simpler approach, you need at a pointer to the current ConnectionList and to the current Connection. Since you already know the structure of your file, this could be easier than using a stack-based parser.

Related

Search query in XML file from java

My Project Manager told me to move all the queries in a xml file (he even made for me), so when the user (via jsp) select the description: "Flusso VLT mensile" he has 2 options, click search, update or download, (the download it works now but I need to get the name of filename), he told me to work with jaxb but I don't think is necessary
<flow-monitor>
<menu1>
<item id="7" type="simple">
<connection name="VALSAP" />
<description value="Flusso VLT mensile" />
<filename value="flussoVltmensile" />
<select><![CDATA[
SELECT * FROM vlt_sap WHERE stato=7
]]>
</select>
<update>
<![CDATA[update vlt_sap set stato = 0 where stato =7]]>
</update>
</item>
<item id="11" type="simple">
<connection name="VALSAP" />
<description value="Flusso REPNORM BERSANI" />
<filename value="flussoRepnormBersani" />
<select><![CDATA[
select * from repnorm_bersani_sap where stato = 99
]]>
</select>
<update>
<![CDATA[update repnorm_bersani_sap set stato=0 where stato = 99]]>
</update>
</item>
</menu1>
</flow-monitor>
On java I should read this xml and depending on <description value=> I should execute the query inside them, any way to easily read the value inside without make a lot of if statement
Anybody knows a good and easy way to achieve all this?
Thanks
There are a few ways to read the XML file and extract the information you need without using a lot of if statements. One approach is to use an XML parsing library such as JAXB or SAX, and create Java classes to represent the XML elements.
In JAXB, you can use the javax.xml.bind.Unmarshaller class to unmarshal the XML file into a Java object, which you can then traverse to extract the information you need.
You should start creating a Java classes based on the XML structure, like FlowMonitor, Menu1, Item, Connection etc. , and use annotation to map the xml elements to the fields.
Then, you can use the unmarshaller.unmarshal() method to parse the XML file and create an instance of the FlowMonitor class, which will contain all the information from the XML file.
Once you have the FlowMonitor object you can loop through the items, and get the description and filename by calling getDescriptionValue() and getFilenameValue() of the item object....

XML parsing:Retrieve multiple rows in xml using digester

While parsing an xml file like the one below, i want to get the list of telephone numbers for one particular id.I am using Digester to do this.But i am not understanding how to add the call methods or createobjects .Can anyone help me with this.My xml file contains 1000's of
types
<?xml version='1.0' encoding='utf-8'?>
<address-book>
<contact type="individual">
<id>50</id>
<city>New York</city>
<province>NY</province>
<postalcode>10013</postalcode>
<country>USA</country>
<address>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
</address>
</contact>
<contact type="business">
<id>52</id>
<city>Zagreb</city>
<province></province>
<postalcode>10000</postalcode>
<country>Croatia</country>
<address>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
<telephone>1-212-345-6789</telephone>
</address>
</contact>
Also how should i stop the parsing when i get the required Id.
Although the question was specific to using the apache-commons-digester, this can be solved by the host of libraries already available in the XML families of functions - namely a SAX parser coupled with an XPath search. Instead of brute-forcing through the data, if what is being searched is known, an XPath query can find the data relatively efficiently. Otherwise, if traversing the entire set of data for indexing or other purposes, again, recommend using a simple SAX parser and looping through the elements (again possibly via an //MyElement type XPath query) and then for each instance, pass the value to a function for indexing or whatever operation. The apache-commons-digester may be overly complicated and/or slow for what is needed.

Navigating an XML file while keeping track of order

I need to convert an XML file in the IOB format.
The XML file represents the structure of a Latex-written paper, i.e. with sections and subsections. In this representation, sections are encoded as BODY, then I have a HEADER and then paragraphs or subsections.
Example:
<DIV DEPTH="1">
<HEADER ID="H-8"> Practical Results </HEADER>
<P TYPE="TXT">
<S ID="S-56" TYPE="TXT"> To assess its performance , <REF REFID="R-12" ID="C-36">Grover et al. 1993</REF> tried various methods . </S>
<S ID="S-57" TYPE="TXT"> The grammar is defined in metagrammatical formalism which is compiled into a unification-based ` object grammar ' -- a syntactic variant of the Definite Clause Grammar formalism <REF REFID="R-21" ID="C-37">Pereira and Warren 1980</REF> -- containing 84 features and 782 phrase structure rules . </S>
<DIV DEPTH="2">
<HEADER ID="H-9"> Comparing the Parsers </HEADER>
<P TYPE="TXT">
<S ID="S-61" TYPE="TXT"> In the first experiment , the ANLT grammar was loaded and a set of sentences was input to each of the three parsers . </S>
</P>
<IMAGE ID="I-0"/>
</DIV>
What I want to do is to keep all the text, but convert it to a different format, i.e. I want to remove the BODY structure, and just tag the HEADERs and the text part like this:
Practical/B-Header Results/I-Header ./O
To/B-Text assess/I-Text its/I-Text performance/I-Text ,/I-Text Grover/I-Text et/I-Text al./I-Text tried/I-Text various/I-Text methods/I-Text ./O
The/B-Text grammar/I-Text ... ./O
And so on.
I know some DOM parsing in Java (for example, I have been using jdom2 for a little while) but I do not know how to keep the order of the text: for example, I want to fetch the content of the REF tag (which is inside S, look at the example), but the text from its parent extends before and after the REF tag.
Any pointers? Should be fairly simple, but searches like "strip XML tags after certain depth" did not help me :-(
I would use an event based xml parser like sTax, sax etc. Then you can keep track of levels, order and other things as you process each tag.

How to get data with tag name & their values inside parent tag in xml

I am working on Java. I am parsing an xml file, I am getting tag values, it is working. I have xml file as follows:
<DOC>
<STUDENT>
<ID>1</ID>
<NAME>DAN</NAME>
<ADDRESS>U.K</ADDRESS>
</STUDENT>
<STUDENT>
<ID>2</ID>
<NAME>JACK</NAME>
<ADDRESS>U.S</ADDRESS>
</STUDENT>
</DOC>
I have question that I want to fetch data inside <DOC>....</DOC> with their tag name & value as well. Means I want data as follows:
"<STUDENT>
<ID>1</ID>
<NAME>DAN</NAME>
<ADDRESS>U.K</ADDRESS>
</STUDENT>
<STUDENT>
<ID>2</ID>
<NAME>JACK</NAME>
<ADDRESS>U.S</ADDRESS>
</STUDENT>"
Please guide me how to do it.
The most common approaches in Java are to use one of either SAX or Dom parsing libraries.
If you look them up you should find loads of documentation/tutorials about them.
Dom is the easiest to use normally as it stores the entire XML in memory and you cna then access any tag, however, this is less performant and can be problematic if you are using very large XML. SAX requires more work, but reads the XML and processes each tag as it gets to it.
Both are able to do what you need though.
Take a look at SAX Parser.
This link might be helpful too: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/

How to create a custom Swing Document structure

I have a JEditorPane holding a custom EditorKit and a custom Document (derived from DefaultStyledDocument).
The following is an example for the content of the JEditorPane:
first paragraph
second paragraph
For the example above I get a document-structure with the following XML-equivalent:
<root>
<section>
<paragraph>
<content>first</content>
<content bold="true">paragraph</content>
</paragraph>
<paragraph>
<content>second paragraph</content>
<content>\n</content>
</paragraph>
</section>
</root>
Note that the tag names above are determined by the Element.getName() function.
My intention is, to extend this structure by custom element types to edit content other than styled text.
An example would be extending the editor to be a music-note editor to get an XML-structure like this:
<root>
<section>
<paragraph>
<content>first</content>
<content bold="true">paragraph</content>
</paragraph>
<musicnotes>
<bar>
<note>C</note>
<note>D</note>
<note>E</note>
</bar>
</musicnotes>
</section>
</root>
As I see it, the Style- and Paragraph-Elements are created upon Document.insertString() and Document.setCharacterAttributes() methods.
My problem is that I have no clue how to override these methods (or write pendants) to not to go back to the default structure but to use custom element kinds.
At all I don't even know if this is the correct approach. Do I have to create my very own Implementation of the Document-interface to create a custom document structure?
See the example of tables creation.
http://java-sl.com/JEditorPaneTables.html
You can use the same defining desired structure.

Categories