parsing large XML using SAX in java - java

I am trying to parse the stack overflow data dump, one of the tables is called posts.xml which has around 10 million entry in it. Sample xml:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="26" CreationDate="2010-07-07T19:06:25.043" Score="10" ViewCount="1192" Body="<p>Now that the Engineer update has come, there will be lots of Engineers building up everywhere. How should this best be handled?</p>
" OwnerUserId="11" LastEditorUserId="56" LastEditorDisplayName="" LastEditDate="2010-08-27T22:38:43.840" LastActivityDate="2010-08-27T22:38:43.840" Title="In Team Fortress 2, what is a good strategy to deal with lots of engineers turtling on the other team?" Tags="<strategy><team-fortress-2><tactics>" AnswerCount="5" CommentCount="7" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="184" CreationDate="2010-07-07T19:07:58.427" Score="5" ViewCount="469" Body="<p>I know I can create a Warp Gate and teleport to Pylons, but I have no idea how to make Warp Prisms or know if there's any other unit capable of transporting.</p>
<p>I would in particular like this to built remote bases in 1v1</p>
" OwnerUserId="10" LastEditorUserId="68" LastEditorDisplayName="" LastEditDate="2010-07-08T00:16:46.013" LastActivityDate="2010-07-08T00:21:13.163" Title="What protoss unit can transport others?" Tags="<starcraft-2><how-to><protoss>" AnswerCount="3" CommentCount="2" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="56" CreationDate="2010-07-07T19:09:46.317" Score="7" ViewCount="356" Body="<p>Steam won't let me have two instances running with the same user logged in.</p>
<p>Does that mean I cannot run a dedicated server on a PC (for example, for Left 4 Dead 2) <em>and</em> play from another machine?</p>
<p>Is there a way to run the dedicated server without running steam? Is there a configuration option I'm missing?</p>
" OwnerUserId="14" LastActivityDate="2010-07-07T19:27:04.777" Title="How can I run a dedicated server from steam?" Tags="<steam><left-4-dead-2><dedicated-server><account>" AnswerCount="1" />
<row Id="4" PostTypeId="1" AcceptedAnswerId="14" CreationDate="2010-07-07T19:11:05.640" Score="10" ViewCount="201" Body="<p>When I get to the insult sword-fighting stage of The Secret of Monkey Island, do I have to learn every single insult and comeback in order to beat the Sword Master?</p>
" OwnerUserId="17" LastEditorUserId="17" LastEditorDisplayName="" LastEditDate="2010-07-08T21:25:04.787" LastActivityDate="2010-07-08T21:25:04.787" Title="Do I have to learn all of the insults and comebacks to be able to advance in The Secret of Monkey Island?" Tags="<monkey-island><adventure>" AnswerCount="3" CommentCount="2" />
I would like to parse this xml, but only load certain attributes of the xml, which are Id, PostTypeId, AcceptedAnswerId and other 2 attributes. Is there a way in SAX so that it only loads these attributes?? If there is then how? I am pretty new to SAX, so some guidance would help.
Otherwise loading the whole thing would just be purely slow and some of the attributes won't be used anyways so it's useless.
One other question is that would it be possible to jump to a particular row that has a row Id X? If possible then how do I do this?

"StartElement" Sax Event permits to process a single XML ELement.
In java code you must implement this method
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if("row".equals(localName)) {
//this code is executed for every xml element "row"
String id = attributes.getValue("id");
String PostTypeId = attributes.getValue("PostTypeId");
String AcceptedAnswerId = attributes.getValue("AcceptedAnswerId");
//others two
// you have your att values for an "row" element
}
}
For every element, you can access:
Namespace URI
XML QName
XML LocalName
Map of attributes, here you can extract your two attributes...
see ContentHandler Implementation for specific deatils.
bye
UPDATED: improved prevous snippet.

It is pretty much the same approach as I've answered here already.
Scroll down to the org.xml.sax Implementation part. You'll only need a custom handler.

Yes, you can override methods that process only the elements you want:
http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232
http://www.java2s.com/Code/Java/XML/SAXDemo.htm

SAX doesn't "load" elements. It informs your application of the start and end of each element, and it's entirely up to your application to decide which elements it takes any notice of.

Related

How to Parse this non organised xml from DOM

How do i parse this Description node?
i need img link and description , and do you think its a correct XML for parsing because for me its not making any sense,
<description><![CDATA[<img width="680" height="538"
src="https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=680" class="attachment-large size-large wp-post-image" alt=""
srcset="https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=680 680w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=1360 1360w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=150 150w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=300 300w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=768 768w"
sizes="(max-width: 680px) 100vw, 680px" />
DogBuddy, a pan-European online marketplace for dog sitting, has closed €5 million in Series A funding, money it plans for further expansion. Backing the London-headquartered startup in this round is existing investor Sweet Capital, the investment fund started by the founders of King.com, and a number of new unnamed private investors. It brings total raised by DogBuddy to €10 million.
Read More]]></description>
I'm using DOM
First of all.
and do you think its a correct XML for parsing because for me its not
making any sense,
of course it is. you can test here. https://codebeautify.org/xmlviewer
Second
if this is the final response, then you have the choice of using.
android.text.Html; // for html tags
android.text.Html.ImageGetter; // for html image tags parsing
classes and Regex. hopefully it will be work.

Applying XSL Stylesheet with Java, output is not indented correctly

Just a heads up, I'm fairly new to Java, XML, and XSL, so bear with me. :)
I'm working on a uni assignment which has asked me to combine two XML files using Java.
While something like this would be much more straight-forward using XSL exclusively, my task was to use Java to combine said files and sort parts of the XML document in order of date attributes.
So; I have the combining working correctly but I'm trying to format my output and sort the parts in question.
In my Java code I have had to remove certain elements from the final (combined) output XML file; and I have been using x.getParentNode().removeChild(x) to do so.
This is leaving me with blank text nodes everywhere, so I have the following XSL to re-format at the final stage:
<!-- Code removed due to possible plagiarism -->
Now this is the weird part. If I apply this XSL to the output document that my Java code produces manually, it works as expected and gives me the perfect result.
However, if I try to apply the XSL with Java, it does strip the whitespace out, but it doesn't give the output document the correct indentation. The code below show what I mean (output XML file has been shortened for clarity):
Correct output when applied manually:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE authors
SYSTEM "output.dtd">
<authors>
<author>
<name>AMADEO, Giovanni Antonio</name>
<born-died>b. ca. 1447, Pavia, d. 1522, Milano</born-died>
<nationality>Italian</nationality>
<biography>Giovanni Antonio was....</biography>
<artworks form="architecture">
<artwork date="1473">
<title>Façade of the church</title>
<technique>Marble</technique>
<location>Certosa, Pavia</location>
</artwork>
</artworks>
</author>
</authors>
Output when applied with Java (using the Transformer class):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE authors SYSTEM "output.dtd">
<authors>
<author>
<name>AMADEO, Giovanni Antonio</name>
<born-died>b. ca. 1447, Pavia, d. 1522, Milano</born-died>
<nationality>Italian</nationality>
<biography>Giovanni Antonio Amadeo was an Italian early Renaissance sculptor, architect, and engineer. In 1466 he was engaged as a sculptor, with his brother Protasio, at the famous Certosa, near Pavia. He was a follower of the style of Bramantino of Milan, and he represents, like him, the Lombard direction of the Renaissance. He practised cutting deeply into marble, arranging draperies in cartaceous folds, and treating surfaces flatly even when he sculptured figures in high relief. Excepting in these technical points he differed from his associates completely, and so far surpassed them that he may be ranked with the great Tuscan artists of his time, which can be said of hardly any other North-Italian sculptor.</biography>
<artworks form="architecture">
<artwork date="1473">
<title>Façade of the church</title>
<technique>Marble</technique>
<location>Certosa, Pavia</location>
</artwork>
</artworks>
</author>
Notice how there is absolutely no indentation?
Here is my Java code that is saving the output & applying the XSL:
// Code removed due to possible plagiarism.
Note:
srcDoc1 is the XML document that I have combined (the program basically pulled stuff out of srcDoc2 and placed it in srcDoc1).
I've been bashing my head on my keyboard for the last couple days and really need some advice as to why this is behaving like this.
Thanks in advance!
We determined in the comment thread that you were loading Saxon in your Eclipse project, but Xalan in your standalone Java application.

JAXB unmarshalling - element that appears multiple times separated by other elements

I am trying to use JAXB to unmarshal an XML file that has an element that occurs 5 times, but not in a row; I want to make some changes, then marshal it back to XML. When written back to the XML file, the instances of the element need to go back in the same order, and be separated by the same intervening elements as before
I know I can represent an element that occurs multiple times with a Collection, and I can specify the order of fields using #XmlType( propOrder = { ... } ), but I can't figure out to do both at the same time...
I tried using 5 different field names in my Java class (encryptedData1, encryptedData2, ...), and 5 different pairs of getters/setters, and then annotating the setters with the same name:
#XmlElement( name = "EncryptedData" )
but when I unmarshal, only the first one gets set, the others are all null. The field that does get filled has the value of the last instance in the XML file, so I'm guessing it's just getting set five times
If I use a List, then when I write out to the XML file, they all get written together
Here is a sample of the original XML; the EncryptedData element is the one in question:
<NodeInformation>
...
<ProxyIpPort>1194</ProxyIpPort>
<LicenseKeyID />
<EncryptedData Type="http://www.w3.org/2001/04/xmlenc#Element" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#tripledes-cbc" />
<CipherData>
<CipherValue>************************</CipherValue>
</CipherData>
</EncryptedData>
<ProxyUsername />
<EncryptedData Type="http://www.w3.org/2001/04/xmlenc#Element" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#tripledes-cbc" />
<CipherData>
<CipherValue>***********************</CipherValue>
</CipherData>
</EncryptedData>
<ActualIpAddress />
<HasMASServer>false</HasMASServer>
<MASServerIpAddress />
<MASServerWebListeningPort>443</MASServerWebListeningPort>
<ModemNumber />
<RememberLoginPassword>true</RememberLoginPassword>
<LoginName>admin</LoginName>
<EncryptedData Type="http://www.w3.org/2001/04/xmlenc#Element" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#tripledes-cbc" />
<CipherData>
<CipherValue>***************************</CipherValue>
</CipherData>
</EncryptedData>
...
</NodeInformation>
Thank you in advance for any insight
Do you only want to change elements != EncryptedData, i.e. keep the unchanged EncryptedData elements in their relative positions?
If this is the case, it might be possible using JAXB Binder,
see JAXB & XML Infoset Preservation

JAXB Multiple Content for Element

I'm currently working on a txt-to-xml project. Basically what I'm doing is creating different XmlElements for some of the content.
I got a DTD up and running and for now I'm creating a default xml, just to make sure every xml created is a valid xml (for the DTD given).
I'm mainly creating new Classes for every Element, which doesn't have a #PCDATA structure and it's working pretty fine so far.
Now I'm struggling with a problem:
I got the following in my DTD:
<!ELEMENT REACTION(#PCDATA | ACTOR*)>
What I'm looking for in my Text is something like:
Prof. X clapped!
and I want to extract this into my XML as:
<REACTION>
<ACTOR>Prof. X</ACTOR> clapped!
</REACTION>
So what I basically want is a String-Attribute within the ReactionClass which is devlares as XML-Element but holds an Actor-Attribute + Rest of the Text. I thought of something like:
String m_sText;
String m_sActor;
public ReactionClass(){
this.Actor = "Prof. X";
this.sText = this.m_sActor + " clapped!";
}
#XmlElement(name = "TEXT")
public String getM_sText(){ return this.m_sText; }
#XmlElement(name = "ACTOR")
public String getM_sActor(){ return this.m_sActor; }
For all other Nodes, such as the RootNode I created a RootNodeClass which holds different attributes, such as m_nLocation, m_nTime, m_nYear which are declared as XML-Elements, so the JAXB-Marshaller just builds up the XML on basis of these elements:
<ROOT>
<TIME>09:00</TIME>
<LOCATION>New York</TIME>
<YEAR>1992</YEAR>
</ROOT>
I wanted to do the same with the REACTION-Node (like mentioned above), but when creating a new Class REACTION I'm getting sth. like:
<REACTION>
<TEXT>Prof. X clapped!</TEXT>
<ACTOR>Prof. X</ACTOR>
</REACTION>
How would I put them into one Element but still keep the Tags such as above?
If anybody got an idea how to manage this I would be very thankful!
Thanks Max
First, what you most probably need is #XmlMixed. You'll probably have a structure like:
#XmlMixed
#XmlElementRefs({
#XmlElementRef(name="ACTOR", type=JAXBElement.class),
...})
List<Object> content;
With this you could put there Strings and JAXBElement<Actor> to achieve so-called mixed content.
Next, you might consider turning your DTD into XML Schema first and compiling it - or compiling the DTD with XJC.
Finally, what you have is so-called "semi-structured data" which I think is not quite suitable for JAXB. JAXB works great for strong and clear structures, but if you have mixed stuff you get weird models that are hard to work with. I can't suggest an alternative though.

Android XML parsing using SAX Parser

I have been trying to parse this ( http://app.calvaryccm.com/mobile/android/v1/devos) URL using a SAX parser found here: http://android-er.blogspot.com/2010/05/simple-rss-reader-iii-show-details-once.html I have been working on how to handle the description tag within the XML. I have tried this with and without the CDATA tag and nothing seems to help. It's almost as if the link is being read into the description.
The first part works just fine:
The problem happens when I try to access the inner page. It's almost as if the link tag is getting read before the description tag is.
I am having an issue in getting the description tag to display right. Thank you for your help!
EDIT the full source code for this application is here: http://dl.dropbox.com/u/19136502/CCM.zip
Ouch, after about 3 hours digging and analyzing your source code, I've found the reason why you have such a weird result like above.
First look at the RSS content from the link you parse: http://app.calvaryccm.com/mobile/android/v1/devos
Some parts of its content:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>CCM Daily Devotions</title>
<link>http://www.calvaryccm.com/resources/dailydevotions.aspx</link>
<description>Calvary Chapel Melbourne's Daily Devotionals</description>
<webMaster>webmaster#calvaryccm.com (Calvary Chapel Melbourne)</webMaster>
<copyright>(c)2011, Calvary Chapel Melbourne. All rights reserved</copyright>
<ttl>60</ttl>
<item>
<guid isPermaLink="false">b3e91cbf-bbe9-4667-bf4c-8ff831ba09f1</guid>
<title>Teachable Moments</title>
<description>Based on &ldquo;Role Models, Part 4&rdquo; by Pastor Mark Balmer; 10/8-9/11,
Message #6078; Daily Devotional #6 - &ldquo;Teachable Moments&rdquo; Preparing the Soil (Introduction): My husband and I took seriously our understanding of God&rsquo;s instructions to teach His commandments to our children. (Deuteronomy 6:7) We went to our local Christian bookstore and bought children&rsquo;s Bibles, studies, coloring books, games&mdash;anything that would help us to communicate biblical situations in their lives. Planting and Watering the Seed (Growth): Each parent needs to take seriously God&rsquo;s commthe Crop (Action/Response): Life is God&rsquo;s classroom for teachable moments. A long delay in traffic can be a frustrating irritation, or it can be an opportunity to teach our children that God&rsquo;s than taught. Cultivating (Additional Reading): Psalm 78:1-8;&nbsp;Psalm 145:4
klw Calvary Chapel of Melbourne; 2955 Minton Road; W. Melbourne, FL 32904; 321-952-9673
NLT = New Living Translation. </description> <link>http://www.calvaryccm.com/resources/dailydevotions.aspx</link> <pubDate>Sun, 16 Oct 2011 12:00:00 GMT</pubDate> </item>
Pay attention closely to this tag /rss/channel/item/description, what you can see are these things: rsquo; or 'squo; or & or ldquo; or rdquo; ... Those are escaped characters (Left Single Quote, Right Single Quote, Ampersand, Right Double Quote, Left Double Quote...even New Line), they are residing in XML content.
So when the XML Parser walk through these characters, it thinks about to escape parsing, which leads to weird result as you are facing right now.
What about solution? At first, I can think of getting the content of the URL first, then unescape those characters (adding SLASH characters), now I think you can parse it again with success.
This solution seems to work well, however, I think it might not, because the RSS text content response from server is in really weird format (not well-formatted). So if you can contact to this web administrator, tell them to format RSS content nicely (like adding SLASH to escape characters, remove all NEW-LINE characters...) before issuing the RSS subscription.
The other solutions is to use some third-party that handle escaping/unescaping stuffs like StringEscapeUtils from Apache Commons: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html or JTidy.
But I don't think these libraries work best in your case.
That's all I can tell.
#p/s: just some comments to your source code, I think you need to think about make your code clear to read, better for maintenance, and re-package appropriately.

Categories