I want to create a basic RSS feeder app for Android and I don't know which library to choose:
SAX or DOM. Which should I choose?
Does anyone have any experience with either on an Android platform?
Any tips?
A SAX based parser will allow you to store only the information that you require with an event handler style interface, while DOM based methods will parse the whole file into an object model. Personally I would use SAX for both it's speed and memory advantages (especially on a mobile environment --- if you don't know the length of the XML at runtime, you could end up with a huge model). SAX allows you to construct your own objects/information as required in the format you want without having the default object model stored on top.
In general however, a SAX based parser is useful if the XML contains machine readable data, and a DOM based parser is useful when you have structured document style data. See here for more information.
Use SAX parser, it has better speed and performance....
Related
We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.
If there is a very big XML and DOM parser is used to parse it.
Now there is a requirement to add/delete elements from the XML i.e edit the XML
How to edit the XML as the entire XML will not be loaded due to memory constraints ?
What could be the strategy to solve this ?
You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. It will be faster and will also use much less memory.
As two other answers mentioned already, a SAX parser will do the trick. Your other alternative to DOM is a StAX parser.
Traditionally, XML APIs are either:
DOM based - the entire document is read into memory as a tree
structure for random access by the calling application
event based - the application registers to receive events as
entities are encountered within the source document.
Both have advantages; the former (for example, DOM) allows for random
access to the document, the latter (e.g. SAX) requires a small memory
footprint and is typically much faster.
These two access metaphors can be thought of as polar opposites. A
tree based API allows unlimited, random access and manipulation, while
an event based API is a 'one shot' pass through the source document.
StAX was designed as a median between these two opposites. In the StAX
metaphor, the programmatic entry point is a cursor that represents a
point within the document. The application moves the cursor forward -
'pulling' the information from the parser as it needs. This is
different from an event based API - such as SAX - which 'pushes' data
to the application - requiring the application to maintain state
between events as necessary to keep track of location within the
document.
StAX is my preferred approach for handling large documents. If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes:
http://xerces.apache.org/xerces-j/faq-write.html#faq-4
Your assumption of memory constraint loading the XML document may only apply to DOM. VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Another distinct benefit, which none other XML framework in existence has, is its incremental update capability...
http://www.devx.com/xml/Article/36379
As stivlo mentioned you can use a SAX parser for reading the XML.
But for writing the XML you can write into fileoutput stream as plain text. I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted.
I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.
I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html
You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.
You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.
I'm sure this might have been discussed at length or answered before, however I need a bit more information on the best approach for my situation...
Problem:
We have some large XML data (anywhere from 100k to 5MB) which we need to inflate into Java objects. The issue is that the data doesn't really doesn't map onto an object very well at all, so we need to only pull certain parts of the data out and create the objects. Given that, solutions such as JAXB or XStream really aren't appropriate.
So, we need to pull XML data out and get it into java objects as efficiently as possible.
Possible Solutions:
The way I see it, we have 3 possible solutions:
SAX parsing
DOM parsing
XSLT
We can load the XML into any JAXP implementation and pull the data out using one of the above methods.
Question(s)
I have a few questions/concerns:
How does XSLT work under the hood? Is it just a DOM parser? I ask because XSLT seems like a good way to go, but I don't really want to consider it if it won't give us better performance than DOM.
What are some popular libraries that provide DOM, XSLT, and SAX XML parsers?
In your experience, what are the reasons for picking DOM, SAX, or XSLT? Does the ease of use of DOM or XSLT totally dominate the performance improvements SAX offers?
Any benchmarks out there? The ones I've found are old (as in, 8 years old). So some recent benchmarks would be appreciated.
Are there any other solutions besides those outlined above that I could be missing?
Edit:
A few clarifications... You can use XSLT to directly inject values into a Java object... it is normally used to transform XML into some other XML, however I'm talking from the standpoint of calling a method from XSLT into java to inject the value.
I'm still not clear on how an XSLT processor works exactly... How is it feeding the XML into the XSLT code you write?
Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.
We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes
JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.
DOM, SAX and XSLT are different animals.
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.
You can use the #XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}
What is the difference between JAXP and JAXB?
JAXP (Java API for XML Processing) is a rather outdated umbrella term covering the various low-level XML APIs in JavaSE, such as DOM, SAX and StAX.
JAXB (Java Architecture for XML Binding) is a specific API (the stuff under javax.xml.bind) that uses annotations to bind XML documents to a java object model.
JAXP is Java API for XML Processing, which provides a platform for us to Parse the XML Files with the DOM Or SAX Parsers.
Where as JAXB is Java Architecture for XML Binding, it will make it easier to access XML documents from applications written in the Java programming language.
For Example : Computer.xml File, if we want to access the data with JAXP, we will be performing the following steps
Create a SAX Parser or DOM Parser and then PArse the data, if we use
DOM, it may be memory intensive if the document is too big. Suppose
if we use SAX parser, we need to identify the beginning of the
document. When it encounters something significant (in SAX terms, an
"event") such as the start of an XML tag, or the text inside of a
tag, it makes that data available to the calling application.
Then Create a content handler that defines the methods to be
notified by the parser when it encounters an event. These methods,
known as callback methods, take the appropriate action on the data
they receive.
The Same Operations if it is performed by JAXB, the following steps needs to be performed to access the Computer.xml
Bind the schema for the XML document.
Unmarshal the document into Java content objects. The Java content objects represent the content and organization of the XML document, and are directly available to your program.
After unmarshalling, your program can access and display the data in the XML document simply by accessing the data in the Java content objects and then displaying it. There is no need to create and use a parser and no need to write a content handler with callback methods. What this means is that developers can access and process XML data without having to know XML or XML processing
The key difference is which role the xml Schema plays. JAXP is outdated without awareness of the XML Schema while JAXB handles the schema binding as the very first step.