Writing an RSS reader in Java

Writing an RSS reader in Java - java

I'm trying to write a basic RSS reader for a class project. Our book shows an example by walking the DOM tree. Is that a decent approach for an RSS reader? Would I just ignore certain tags that are of uninterest to me and not to be used by the RSS Reader? Thanks.

For inspiration you can look at ROME, an open source tool for handling RSS & Atom feeds.

It's one of two common approaches, so yes. And yes, ignoring tags that are not of interest is a good way to handle it. If you don't need it, no need to take note of it. If you know in advance exactly what tags you need, you probably don't need to walk the entire DOM tree.
You could also use a SAX parser which would probably be faster and less memory intensive, though probably not necessary in this case, depending on how many results you wish to have in the feed.

Well, the beauty of RSS feed is they always come in some standard structure, even though some feeds contain non-standard fields, like Google Picasa's RSS feed. The most straightforward approach, in my opinion, is to use a tool that allows you to unmarshall the RSS XML feed into your RSS bean. This way, you don't need to write too much code, and you can pick what fields you want and ignore fields you don't want.
In my case, I use Castor to perform the unmarshalling process where I read the Google Picasa RSS feed and gather only the fields I want. Hope this helps.

Processing Atom Feeds with JAXB
You could also map your XML to objects using JAXB. You could then use these objects in your RSS reader.
http://bdoughan.blogspot.com/2010/09/processing-atom-feeds-with-jaxb.html
The JAXB reference implementation is included in Java SE 6, there are also other implementations such as MOXy (I'm the tech lead):
http://wiki.eclipse.org/EclipseLink/Examples/MOXy/GettingStarted
You only need to map the portions you are interested in.
Processing Atom Feeds with SDO
You could also use Service Data Objects (SDO) to do this:
http://bdoughan.blogspot.com/2010/09/processing-atom-feeds-with-sdo.html

I have both parsed and produced RSS with the JDOM library. Its been around a long time and is updated in-frequently, but my experience is that it hasn't needed updating. You may want to look into it but since its quite powerful, you may find that it offers too much functionality.
http://jdom.org/

Related

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?

I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH

Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/

Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

Android XML parser for simple xml node strings

I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.

I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html

You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html

I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.

You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.

How to deserialize Java objects from XML?

I'm sure this might have been discussed at length or answered before, however I need a bit more information on the best approach for my situation...
Problem:
We have some large XML data (anywhere from 100k to 5MB) which we need to inflate into Java objects. The issue is that the data doesn't really doesn't map onto an object very well at all, so we need to only pull certain parts of the data out and create the objects. Given that, solutions such as JAXB or XStream really aren't appropriate.
So, we need to pull XML data out and get it into java objects as efficiently as possible.
Possible Solutions:
The way I see it, we have 3 possible solutions:
SAX parsing
DOM parsing
XSLT
We can load the XML into any JAXP implementation and pull the data out using one of the above methods.
Question(s)
I have a few questions/concerns:
How does XSLT work under the hood? Is it just a DOM parser? I ask because XSLT seems like a good way to go, but I don't really want to consider it if it won't give us better performance than DOM.
What are some popular libraries that provide DOM, XSLT, and SAX XML parsers?
In your experience, what are the reasons for picking DOM, SAX, or XSLT? Does the ease of use of DOM or XSLT totally dominate the performance improvements SAX offers?
Any benchmarks out there? The ones I've found are old (as in, 8 years old). So some recent benchmarks would be appreciated.
Are there any other solutions besides those outlined above that I could be missing?
Edit:
A few clarifications... You can use XSLT to directly inject values into a Java object... it is normally used to transform XML into some other XML, however I'm talking from the standpoint of calling a method from XSLT into java to inject the value.
I'm still not clear on how an XSLT processor works exactly... How is it feeding the XML into the XSLT code you write?

Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.

We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes

JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.

DOM, SAX and XSLT are different animals.
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.

You can use the #XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}

Castor and sockets

I'm new to Castor and data binding in general. I'm working on an application that, in part, needs to take data off of a socket and unmarshall the data to make POJOs. Now, I've got the socket stuff down, and I've even generated and compiled java files thanks to Ant and Castor.
Here's the problem: the data stream that I'll receive could be one of about 9 different objects. That is, I receive a stream of text (XML) that represents an object with stuff that I'll operate on; again, depending on the object type. If it were just one object, it'd be easy: call the unmarshall commands on it and go on my merry way. But, since it could be one of many kinds of objects, who do I know what to unmarshall? I read up on mapping, but either I didn't get it, or it seems like a static mapping, not a dynamic mapping.
Any help out there?

You are right, Castor expects a static mapping. But you can work with that. You can write some code that will modify the incoming xml so that, on your side, Castor can use one schema, and on your clients' side they don't have to change their schemas.
Change the schema that Castor expects to get to something with a common root-element, with under that your nine different alternatives for your different objects (I think you can restrict it so the schema will allow only one of the nine, if that doesn't work out you could just make all the sub-elements optional).
Then you can write code that modifies the incoming xml to wrap your incoming xml with that common root-element, then feeds the wrapped xml into a stream that gets read by the Castor unmarshaller.
There are at least 3 different ways to implement the xml-wrapping part: SAX, XSLT, and XML libraries (like JDOM, DOM4J, and XOM--I prefer XOM but any of them will work).
The SAX way is probably best if you're already familiar with SAX or if one of the other ways has worked but come up short on performance. If I had to implement that then I would create an XMLFilter that takes in xml and writes xml out, stacking that on top of another piece that writes xml to an OutputStream, and writing a wrapper method around the unmarshalling stuff to feed the incoming stream to the xmlreader, copy the OutputStream to another InputStream (an easy way is to use commons-io), and feed the new InputStream to the Castor unmarshaller.
With XSLT there is no fooling with SAX, although XSLT has a reputation for pain sometimes, it seems to me like this might be a relatively straightforward transformation, but I haven't taken a stab at it either. It is a long time since I used XSLT for anything. I am not sure about performance either, though I wouldn't write it off out of hand.
Using XOM or JDOM or DOM4J to wrap the XML is also possible, and the learning curve is a lot lower than for SAX or XSLT. The downside is the whole XML document tends to get sucked into memory at once so if you deal with big enough documents you could run out of memory.

I have a similar thing in Jibx where all of the incoming message objects implement a base interface which has a field denoting the message type.
The text/xml is serialized into the base interface and I then used the command pattern to call the respective business logic depending upon the message type which is defined in the base interface.
Not sure if this is possible using castor but take a look at Jibx as the performance is fantastic.
http://jibx.sourceforge.net/

I appreciate your insights. You both have given me some good information to go on and new knowledge that I didn't have. In the end, I got the process to work via a hack. I grab the text stream, parse out the root tag of the message, and then switch on it to determine the right object to create. I'm unmarshalling all of my objects independently and everyone is happy on our end.

Small modification to an XML document using StAX

I'm currently trying to read in an XML file, make some minor changes (alter the value of some attributes), and write it back out again.
I have intended to use a StAX parser (javax.xml.stream.XMLStreamReader) to read in each event, see if it was one I wanted to change, and then pass it straight on to the StAX writer (javax.xml.stream.XMLStreamReader) if no changes were required.
Unfortunately, that doesn't look to be so simple - The writer has no way to take an event type and a parser object, only methods like writeAttribute and writeStartElement. Obviously I could write a big switch statement with a case for every possible type of element which can occur in an XML document, and just write it back out again, but it seems like a lot of trouble for something which seems like it should be simple.
Is there something I'm missing that makes it easy to write out a very similar XML document to the one you read in with StAX?

After a bit of mucking around, the answer seems to be to use the Event reader/writer versions rather than the Stream versions.
(i.e. javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter)
See also http://www.devx.com/tips/Tip/37795, which is what finally got me moving.

StAX works pretty well and is very fast. I used it in a project to parse XML files which are up to 20MB. I don't have a thorough analysis, but it was definitely faster than SAX.
As for your question: The difference between streaming and event-handling, AFAIK is control. With the streaming API you can walk through your document step by step and get the contents you want. Whereas the event-based API you can only handle what you are interested in.

I know this is rather old question, but if anyone else is looking for something like this, there is another alternative: Woodstox Stax2 extension API has method:
XMLStreamWriter2.copyEventFromReader(XMLStreamReader2 r, boolean preserveEventData)
which copies the currently pointed-to event from stream reader using stream writer. This is not only simple but very efficient. I have used it for similar modifications with success.
(how to get XMLStreamWriter2 etc? All Woodstox-provided instances implement these extended versions -- plus there are wrappers in case someone wants to use "basic" Stax variants, as well)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.