I've been looking at loading XML files with Java and I just can't seem to decipher a certain part of it.
I understand that SAX is a streaming mechanism, but when talking about DOM, various sites talk about the model "loading in the full file" or "loading in the all tags", supported by the recommendation to use SAX with large XML files.
To what degree does DOM actually load the full file? The second I access the root node, does it allocate program memory for every single byte of the file? Does it only load tags until the lowest level when it loads text contents?
I'm going to be working with large files, but random access would be useful and editing is a requirement, so I believe DOM is the best choice for me.
Thanks a lot.
It does load the entire file and constructs a tree structure in memory. Thus, every single tag, attribute and any nested tags (no matter how many levels of nesting) will be loaded. It is just that the constructed tree grows bigger the larger the XML file you have.
Yes, DOM reads the whole document, parses it, and places it in memory.
If you're parsing using DOM, you do something similar to this:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(file);
(inside a try/catch)
The moment the parse is executed, the Document doc variable will contain the entire document represented as a DOM hierarchy.
Related
I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.
So my question is, given a search term do I:
Do I load the document in memory once (into a list of beans and then
store it in memory)? And then search it when need be?
or
Parse the document looking for the desired search term and only add
the matches to the list of beans? And repeat this process with each
search?
I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?
Regards,
Nate
When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.
Assuming your search field is a field that is known to you, for example let the structure of the xml be:
<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>
and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.
To understand the difference between STAX and SAX, please refer:
When should I choose SAX over StAX?
Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.
Note: Only x and its children will be loaded to memory, not the entire document parsed till now.
Do not use any approaches that use DOM parsers.
Sample code to load only the part of the document where the search field is present.
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
xsr.nextTag();
}
JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();
X x = jb.getValue();
System.out.println(x.y.content);
Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.
Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.
Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.
I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.
I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.
I read some articles about the XML parsers and came across SAX and DOM.
SAX is event-based and DOM is tree model -- I don't understand the differences between these concepts.
From what I have understood, event-based means some kind of event happens to the node. Like when one clicks a particular node it will give all the sub nodes rather than loading all the nodes at the same time. But in the case of DOM parsing it will load all the nodes and make the tree model.
Is my understanding correct?
Please correct me If I am wrong or explain to me event-based and tree model in a simpler manner.
Well, you are close.
In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>), then it triggers the tagStarted event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>), it triggers tagEnded. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.
In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.
In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.
In just a few words...
SAX (Simple API for XML): Is a stream-based processor. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted() etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.
DOM (Document Object Model): You load the whole thing into memory - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.
Here in simpler words:
DOM
Tree model parser (Object based) (Tree of nodes).
DOM loads the file into the memory and then parse- the file.
Has memory constraints since it loads the whole XML file before parsing.
DOM is read and write (can insert or delete nodes).
If the XML content is small, then prefer DOM parser.
Backward and forward search is possible for searching the tags and evaluation of the
information inside the tags. So this gives the ease of navigation.
Slower at run time.
SAX
Event based parser (Sequence of events).
SAX parses the file as it reads it, i.e. parses node by node.
No memory constraints as it does not store the XML content in the memory.
SAX is read only i.e. can’t insert or delete the node.
Use SAX parser when memory content is large.
SAX reads the XML file from top to bottom and backward navigation is not possible.
Faster at run time.
You are correct in your understanding of the DOM based model. The XML file will be loaded as a whole and all its contents will be built as an in-memory representation of the tree the document represents. This can be time- and memory-consuming, depending on how large the input file is. The benefit of this approach is that you can easily query any part of the document, and freely manipulate all the nodes in the tree.
The DOM approach is typically used for small XML structures (where small depends on how much horsepower and memory your platform has) that may need to be modified and queried in different ways once they have been loaded.
SAX on the other hand is designed to handle XML input of virtually any size. Instead of the XML framework doing the hard work for you in figuring out the structure of the document and preparing potentially lots of objects for all the nodes, attributes etc., SAX completely leaves that to you.
What it basically does is read the input from the top and invoke callback methods you provide when certain "events" occur. An event might be hitting an opening tag, an attribute in the tag, finding text inside an element or coming across an end-tag.
SAX stubbornly reads the input and tells you what it sees in this fashion. It is up to you to maintain all state-information you require. Usually this means you will build up some sort of state-machine.
While this approach to XML processing is a lot more tedious, it can be very powerful, too. Imagine you want to just extract the titles of news articles from a blog feed. If you read this XML using DOM it would load all the article contents, all the images etc. that are contained in the XML into memory, even though you are not even interested in it.
With SAX you can just check if the element name is (e. g.) "title" whenever your "startTag" event method is called. If so, you know that you needs to add whatever the next "elementText" event offers you. When you receive the "endTag" event call, you check again if this is the closing element of the "title". After that, you just ignore all further elements, until either the input ends, or another "startTag" with a name of "title" comes along. And so on...
You could read through megabytes and megabytes of XML this way, just extracting the tiny amount of data you need.
The negative side of this approach is of course, that you need to do a lot more book-keeping yourself, depending on what data you need to extract and how complicated the XML structure is. Furthermore, you naturally cannot modify the structure of the XML tree, because you never have it in hand as a whole.
So in general, SAX is suitable for combing through potentially large amounts of data you receive with a specific "query" in mind, but need not modify, while DOM is more aimed at giving you full flexibility in changing structure and contents, at the expense of higher resource demand.
You're comparing apples and pears. SAX is a parser that parses serialized DOM structures. There are many different parsers, and "event-based" refers to the parsing method.
Maybe a small recap is in order:
The document object model (DOM) is an abstract data model that describes a hierarchical, tree-based document structure; a document tree consists of nodes, namely element, attribute and text nodes (and some others). Nodes have parents, siblings and children and can be traversed, etc., all the stuff you're used to from doing JavaScript (which incidentally has nothing to do with the DOM).
A DOM structure may be serialized, i.e. written to a file, using a markup language like HTML or XML. An HTML or XML file thus contains a "written out" or "flattened out" version of an abstract document tree.
For a computer to manipulate, or even display, a DOM tree from a file, it has to deserialize, or parse, the file and reconstruct the abstract tree in memory. This is where parsing comes in.
Now we come to the nature of parsers. One way to parse would be to read in the entire document and recursively build up a tree structure in memory, and finally expose the entire result to the user. (I suppose you could call these parsers "DOM parsers".) That would be very handy for the user (I think that's what PHP's XML parser does), but it suffers from scalability problems and becomes very expensive for large documents.
On the other hand, event-based parsing, as done by SAX, looks at the file linearly and simply makes call-backs to the user whenever it encounters a structural piece of data, like "this element started", "that element ended", "some text here", etc. This has the benefit that it can go on forever without concern for the input file size, but it's a lot more low-level because it requires the user to do all the actual processing work (by providing call-backs). To return to your original question, the term "event-based" refers to those parsing events that the parser raises as it traverses the XML file.
The Wikipedia article has many details on the stages of SAX parsing.
In practical: book.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
DOM presents the xml document as a the following tree-structure in memory.
DOM is W3C standard.
DOM parser works on Document Object Model.
DOM occupies more memory, preferred for small XML documents
DOM is Easy to navigate either forward or backward.
SAX presents the xml document as event based like start element:abc, end element:abc.
SAX is not W3C standard, it was developed by group of developers.
SAX does not use memory, preferred for large XML documents.
Backward navigation is not possible as it sequentially process the documents.
Event happens to a node/element and it gives all sub nodes(Latin nodus, ‘knot’).
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
start element: bookstore
start element: book with an attribute category equal to cooking
start element: title with an attribute lang equal to en
Text node, with data equal to Everyday Italian
....
end element: title
.....
end element: book
end element: bookstore
Both SAX and DOM are used to parse the XML document. Both has advantages and disadvantages and can be used in our programming depending on the situation
SAX:
Parses node by node
Does not store the XML in memory
We cant insert or delete a node
Top to bottom traversing
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
If we need to find a node and does not need to insert or delete we can go with SAX itself otherwise DOM provided we have more memory.
If there is a very big XML and DOM parser is used to parse it.
Now there is a requirement to add/delete elements from the XML i.e edit the XML
How to edit the XML as the entire XML will not be loaded due to memory constraints ?
What could be the strategy to solve this ?
You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. It will be faster and will also use much less memory.
As two other answers mentioned already, a SAX parser will do the trick. Your other alternative to DOM is a StAX parser.
Traditionally, XML APIs are either:
DOM based - the entire document is read into memory as a tree
structure for random access by the calling application
event based - the application registers to receive events as
entities are encountered within the source document.
Both have advantages; the former (for example, DOM) allows for random
access to the document, the latter (e.g. SAX) requires a small memory
footprint and is typically much faster.
These two access metaphors can be thought of as polar opposites. A
tree based API allows unlimited, random access and manipulation, while
an event based API is a 'one shot' pass through the source document.
StAX was designed as a median between these two opposites. In the StAX
metaphor, the programmatic entry point is a cursor that represents a
point within the document. The application moves the cursor forward -
'pulling' the information from the parser as it needs. This is
different from an event based API - such as SAX - which 'pushes' data
to the application - requiring the application to maintain state
between events as necessary to keep track of location within the
document.
StAX is my preferred approach for handling large documents. If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes:
http://xerces.apache.org/xerces-j/faq-write.html#faq-4
Your assumption of memory constraint loading the XML document may only apply to DOM. VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Another distinct benefit, which none other XML framework in existence has, is its incremental update capability...
http://www.devx.com/xml/Article/36379
As stivlo mentioned you can use a SAX parser for reading the XML.
But for writing the XML you can write into fileoutput stream as plain text. I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted.
I have a JAVA class in which I have implemented parsing of XML using the DOM parser. The XML file that is parsed is a configuration file which has configuration params. Any request coming to my website will be redirected based on the information that is returned from this xml file. I have 2 questions around this
1) I would like to do the file parsing only once and not every time. Since, DOM parser loads the xml into memory after the first time I would like to know how to check if the file is already available in the memory? so that the following is not called everytime
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File(sFpath));
2) If the xml file changes how do I make sure the new chaged xml file is re-loaded.
Thanks,
The DOM is an intermediate format - parse it in some application-specific (and friendly) object structure, and stash that in a singleton. You don't want to go hunting through a DOM for every web request. Then, regularly (every x minutes or y web requests), check whether the file has been updated, re-parse it, and update your singleton.
The reading of file should be implemented separately,i mean not along with the code that you handle requests or maybe in static initialization block and then you can use a file watcher to detect file changes.Options for file watching :
File Watcher
WatchService API(Java 7)
JFileNotify
You can keep the DOM in application memory just like any other data - the details depend on what application server / framework you are using. But DOM is a poor choice, not just because of its clumsy API, but also because DOM is not thread-safe, so all access would need to be synchronized. You're better off with a tree model that is read-only once parsed. Consider using Saxon and XPath/XQuery for this - load the tree once into a read only Saxon tree that can then be repeatedly accessed using XPath or XQuery, invoked from your Java application.
Creating Java classes to represent your configuration data more explicitly, as suggested by cdegroot, is an alternative, but not really necessary in my view. It will probably involve more work for you each time you add something to the configuration file.