Android XML parsing using SAX Parser - java

I have been trying to parse this ( http://app.calvaryccm.com/mobile/android/v1/devos) URL using a SAX parser found here: http://android-er.blogspot.com/2010/05/simple-rss-reader-iii-show-details-once.html I have been working on how to handle the description tag within the XML. I have tried this with and without the CDATA tag and nothing seems to help. It's almost as if the link is being read into the description.
The first part works just fine:
The problem happens when I try to access the inner page. It's almost as if the link tag is getting read before the description tag is.
I am having an issue in getting the description tag to display right. Thank you for your help!
EDIT the full source code for this application is here: http://dl.dropbox.com/u/19136502/CCM.zip

Ouch, after about 3 hours digging and analyzing your source code, I've found the reason why you have such a weird result like above.
First look at the RSS content from the link you parse: http://app.calvaryccm.com/mobile/android/v1/devos
Some parts of its content:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>CCM Daily Devotions</title>
<link>http://www.calvaryccm.com/resources/dailydevotions.aspx</link>
<description>Calvary Chapel Melbourne's Daily Devotionals</description>
<webMaster>webmaster#calvaryccm.com (Calvary Chapel Melbourne)</webMaster>
<copyright>(c)2011, Calvary Chapel Melbourne. All rights reserved</copyright>
<ttl>60</ttl>
<item>
<guid isPermaLink="false">b3e91cbf-bbe9-4667-bf4c-8ff831ba09f1</guid>
<title>Teachable Moments</title>
<description>Based on &ldquo;Role Models, Part 4&rdquo; by Pastor Mark Balmer; 10/8-9/11,
Message #6078; Daily Devotional #6 - &ldquo;Teachable Moments&rdquo; Preparing the Soil (Introduction): My husband and I took seriously our understanding of God&rsquo;s instructions to teach His commandments to our children. (Deuteronomy 6:7) We went to our local Christian bookstore and bought children&rsquo;s Bibles, studies, coloring books, games&mdash;anything that would help us to communicate biblical situations in their lives. Planting and Watering the Seed (Growth): Each parent needs to take seriously God&rsquo;s commthe Crop (Action/Response): Life is God&rsquo;s classroom for teachable moments. A long delay in traffic can be a frustrating irritation, or it can be an opportunity to teach our children that God&rsquo;s than taught. Cultivating (Additional Reading): Psalm 78:1-8;&nbsp;Psalm 145:4
klw Calvary Chapel of Melbourne; 2955 Minton Road; W. Melbourne, FL 32904; 321-952-9673
NLT = New Living Translation. </description> <link>http://www.calvaryccm.com/resources/dailydevotions.aspx</link> <pubDate>Sun, 16 Oct 2011 12:00:00 GMT</pubDate> </item>
Pay attention closely to this tag /rss/channel/item/description, what you can see are these things: rsquo; or 'squo; or & or ldquo; or rdquo; ... Those are escaped characters (Left Single Quote, Right Single Quote, Ampersand, Right Double Quote, Left Double Quote...even New Line), they are residing in XML content.
So when the XML Parser walk through these characters, it thinks about to escape parsing, which leads to weird result as you are facing right now.
What about solution? At first, I can think of getting the content of the URL first, then unescape those characters (adding SLASH characters), now I think you can parse it again with success.
This solution seems to work well, however, I think it might not, because the RSS text content response from server is in really weird format (not well-formatted). So if you can contact to this web administrator, tell them to format RSS content nicely (like adding SLASH to escape characters, remove all NEW-LINE characters...) before issuing the RSS subscription.
The other solutions is to use some third-party that handle escaping/unescaping stuffs like StringEscapeUtils from Apache Commons: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html or JTidy.
But I don't think these libraries work best in your case.
That's all I can tell.
#p/s: just some comments to your source code, I think you need to think about make your code clear to read, better for maintenance, and re-package appropriately.

Related

How to Parse this non organised xml from DOM

How do i parse this Description node?
i need img link and description , and do you think its a correct XML for parsing because for me its not making any sense,
<description><![CDATA[<img width="680" height="538"
src="https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=680" class="attachment-large size-large wp-post-image" alt=""
srcset="https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=680 680w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=1360 1360w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=150 150w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=300 300w,
https://tctechcrunch2011.files.wordpress.com/2017/09/team1.png?w=768 768w"
sizes="(max-width: 680px) 100vw, 680px" />
DogBuddy, a pan-European online marketplace for dog sitting, has closed €5 million in Series A funding, money it plans for further expansion. Backing the London-headquartered startup in this round is existing investor Sweet Capital, the investment fund started by the founders of King.com, and a number of new unnamed private investors. It brings total raised by DogBuddy to €10 million.
Read More]]></description>
I'm using DOM
First of all.
and do you think its a correct XML for parsing because for me its not
making any sense,
of course it is. you can test here. https://codebeautify.org/xmlviewer
Second
if this is the final response, then you have the choice of using.
android.text.Html; // for html tags
android.text.Html.ImageGetter; // for html image tags parsing
classes and Regex. hopefully it will be work.

Applying XSL Stylesheet with Java, output is not indented correctly

Just a heads up, I'm fairly new to Java, XML, and XSL, so bear with me. :)
I'm working on a uni assignment which has asked me to combine two XML files using Java.
While something like this would be much more straight-forward using XSL exclusively, my task was to use Java to combine said files and sort parts of the XML document in order of date attributes.
So; I have the combining working correctly but I'm trying to format my output and sort the parts in question.
In my Java code I have had to remove certain elements from the final (combined) output XML file; and I have been using x.getParentNode().removeChild(x) to do so.
This is leaving me with blank text nodes everywhere, so I have the following XSL to re-format at the final stage:
<!-- Code removed due to possible plagiarism -->
Now this is the weird part. If I apply this XSL to the output document that my Java code produces manually, it works as expected and gives me the perfect result.
However, if I try to apply the XSL with Java, it does strip the whitespace out, but it doesn't give the output document the correct indentation. The code below show what I mean (output XML file has been shortened for clarity):
Correct output when applied manually:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE authors
SYSTEM "output.dtd">
<authors>
<author>
<name>AMADEO, Giovanni Antonio</name>
<born-died>b. ca. 1447, Pavia, d. 1522, Milano</born-died>
<nationality>Italian</nationality>
<biography>Giovanni Antonio was....</biography>
<artworks form="architecture">
<artwork date="1473">
<title>Façade of the church</title>
<technique>Marble</technique>
<location>Certosa, Pavia</location>
</artwork>
</artworks>
</author>
</authors>
Output when applied with Java (using the Transformer class):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE authors SYSTEM "output.dtd">
<authors>
<author>
<name>AMADEO, Giovanni Antonio</name>
<born-died>b. ca. 1447, Pavia, d. 1522, Milano</born-died>
<nationality>Italian</nationality>
<biography>Giovanni Antonio Amadeo was an Italian early Renaissance sculptor, architect, and engineer. In 1466 he was engaged as a sculptor, with his brother Protasio, at the famous Certosa, near Pavia. He was a follower of the style of Bramantino of Milan, and he represents, like him, the Lombard direction of the Renaissance. He practised cutting deeply into marble, arranging draperies in cartaceous folds, and treating surfaces flatly even when he sculptured figures in high relief. Excepting in these technical points he differed from his associates completely, and so far surpassed them that he may be ranked with the great Tuscan artists of his time, which can be said of hardly any other North-Italian sculptor.</biography>
<artworks form="architecture">
<artwork date="1473">
<title>Façade of the church</title>
<technique>Marble</technique>
<location>Certosa, Pavia</location>
</artwork>
</artworks>
</author>
Notice how there is absolutely no indentation?
Here is my Java code that is saving the output & applying the XSL:
// Code removed due to possible plagiarism.
Note:
srcDoc1 is the XML document that I have combined (the program basically pulled stuff out of srcDoc2 and placed it in srcDoc1).
I've been bashing my head on my keyboard for the last couple days and really need some advice as to why this is behaving like this.
Thanks in advance!
We determined in the comment thread that you were loading Saxon in your Eclipse project, but Xalan in your standalone Java application.

Unable to parse few xml nodes.What's the protection applied?

I have an xml feed like this
<item><title>Left hopes BJP surge will eat into Mamata’s votes </title><link>http://timesofindia.feedsportal.com/c/33039/f/533916/s/39439a29/sc/7/l/0Ltimesofindia0Bindiatimes0N0Cindia0CLeft0Ehopes0EBJP0Esurge0Ewill0Eeat0Einto0EMamatas0Evotes0Chome0Clok0Csabha0Celections0C20A140Cnews0CLeft0Ehopes0EBJP0Esurge0Ewill0Eeat0Einto0EMamatas0Evotes0Carticleshow0C336252890Bcms/story01.htm</link><description>At times sworn enemies can be of help to each other, albeit indirectly. In the current political winds of West Bengal, no one knows it better than the Left.<img width='1' height='1' src='http://timesofindia.feedsportal.com/c/33039/f/533916/s/39439a29/sc/7/mf.gif' border='0'/><br clear='all'/><br/><br/><a href="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/1/rc.htm" rel="nofollow"><img src="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/1/rc.img" border="0"/></a><br/><a href="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/2/rc.htm" rel="nofollow"><img src="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/2/rc.img" border="0"/></a><br/><a href="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/3/rc.htm" rel="nofollow"><img src="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/rc/3/rc.img" border="0"/></a><br/><br/><a href="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/a2.htm"><img src="http://da.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/a2.img" border="0"/></a><img width="1" height="1" src="http://pi.feedsportal.com/r/194480044196/u/409/f/533916/c/33039/s/39439a29/sc/7/a2t.img" border="0"/></description><pubDate>Fri, 11 Apr 2014 19:26:07 GMT</pubDate><guid isPermaLink="false">http://timesofindia.indiatimes.com/india/Left-hopes-BJP-surge-will-eat-into-Mamatas-votes/home/lok/sabha/elections/2014/news/Left-hopes-BJP-surge-will-eat-into-Mamatas-votes/articleshow/33625289.cms</guid></item>
I am using Jaunt API to scrape the news title and link from this feed.Here is the code
agent.visit("http://timesofindia.feedsportal.com/c/33039/f/533916/index.rss");
Elements items=agent.doc.findEach("<item>");
for(Element item:items)
{
headline=item.findFirst("<title>").getText();
link=item.findFirst("<link>").getText();
System.out.println("headline:"+headline+"\nlink:"+link+"\n");
}
Now I am getting all the headlines but null for link!!!.The same thing happened when I was scraping another newspaper feed.Is there any special thing (encoding) about that link node which is giving null or am I doing something wrong.
I'm not sure, but it is possible that findFirst doesn't handle <link> because findFirst is more oriented towards comments. Would getFirst with an appropriate query be viable?

Parsing xml files on hadoop...

How does one parse an xml file on Hadoop with structure like following:
<row Id="2292" PostTypeId="2" ParentId="2284" CreationDate="2008-08-05T13:28:06.700" Score="0" ViewCount="0" Body="<p>The first thing you should do is contact the main people who run the open source project. Ask them if it is ok to contribute to the code and go from there.</p>
<p>Simply writing your improved code and then giving it to them may result in your code being rejected.</p>" OwnerUserId="383" LastActivityDate="2008-08-05T13:28:06.700" />
Note: I have written code for it but its not workin correctly. Need a fresh approach...
Thanks in advance...
Take a look at the XMLInputFormat, it may have to modified a bit.

parsing large XML using SAX in java

I am trying to parse the stack overflow data dump, one of the tables is called posts.xml which has around 10 million entry in it. Sample xml:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="26" CreationDate="2010-07-07T19:06:25.043" Score="10" ViewCount="1192" Body="<p>Now that the Engineer update has come, there will be lots of Engineers building up everywhere. How should this best be handled?</p>
" OwnerUserId="11" LastEditorUserId="56" LastEditorDisplayName="" LastEditDate="2010-08-27T22:38:43.840" LastActivityDate="2010-08-27T22:38:43.840" Title="In Team Fortress 2, what is a good strategy to deal with lots of engineers turtling on the other team?" Tags="<strategy><team-fortress-2><tactics>" AnswerCount="5" CommentCount="7" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="184" CreationDate="2010-07-07T19:07:58.427" Score="5" ViewCount="469" Body="<p>I know I can create a Warp Gate and teleport to Pylons, but I have no idea how to make Warp Prisms or know if there's any other unit capable of transporting.</p>
<p>I would in particular like this to built remote bases in 1v1</p>
" OwnerUserId="10" LastEditorUserId="68" LastEditorDisplayName="" LastEditDate="2010-07-08T00:16:46.013" LastActivityDate="2010-07-08T00:21:13.163" Title="What protoss unit can transport others?" Tags="<starcraft-2><how-to><protoss>" AnswerCount="3" CommentCount="2" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="56" CreationDate="2010-07-07T19:09:46.317" Score="7" ViewCount="356" Body="<p>Steam won't let me have two instances running with the same user logged in.</p>
<p>Does that mean I cannot run a dedicated server on a PC (for example, for Left 4 Dead 2) <em>and</em> play from another machine?</p>
<p>Is there a way to run the dedicated server without running steam? Is there a configuration option I'm missing?</p>
" OwnerUserId="14" LastActivityDate="2010-07-07T19:27:04.777" Title="How can I run a dedicated server from steam?" Tags="<steam><left-4-dead-2><dedicated-server><account>" AnswerCount="1" />
<row Id="4" PostTypeId="1" AcceptedAnswerId="14" CreationDate="2010-07-07T19:11:05.640" Score="10" ViewCount="201" Body="<p>When I get to the insult sword-fighting stage of The Secret of Monkey Island, do I have to learn every single insult and comeback in order to beat the Sword Master?</p>
" OwnerUserId="17" LastEditorUserId="17" LastEditorDisplayName="" LastEditDate="2010-07-08T21:25:04.787" LastActivityDate="2010-07-08T21:25:04.787" Title="Do I have to learn all of the insults and comebacks to be able to advance in The Secret of Monkey Island?" Tags="<monkey-island><adventure>" AnswerCount="3" CommentCount="2" />
I would like to parse this xml, but only load certain attributes of the xml, which are Id, PostTypeId, AcceptedAnswerId and other 2 attributes. Is there a way in SAX so that it only loads these attributes?? If there is then how? I am pretty new to SAX, so some guidance would help.
Otherwise loading the whole thing would just be purely slow and some of the attributes won't be used anyways so it's useless.
One other question is that would it be possible to jump to a particular row that has a row Id X? If possible then how do I do this?
"StartElement" Sax Event permits to process a single XML ELement.
In java code you must implement this method
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if("row".equals(localName)) {
//this code is executed for every xml element "row"
String id = attributes.getValue("id");
String PostTypeId = attributes.getValue("PostTypeId");
String AcceptedAnswerId = attributes.getValue("AcceptedAnswerId");
//others two
// you have your att values for an "row" element
}
}
For every element, you can access:
Namespace URI
XML QName
XML LocalName
Map of attributes, here you can extract your two attributes...
see ContentHandler Implementation for specific deatils.
bye
UPDATED: improved prevous snippet.
It is pretty much the same approach as I've answered here already.
Scroll down to the org.xml.sax Implementation part. You'll only need a custom handler.
Yes, you can override methods that process only the elements you want:
http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232
http://www.java2s.com/Code/Java/XML/SAXDemo.htm
SAX doesn't "load" elements. It informs your application of the start and end of each element, and it's entirely up to your application to decide which elements it takes any notice of.

Categories