Process XML file multiple times

Process XML file multiple times - java

I have a JAVA class in which I have implemented parsing of XML using the DOM parser. The XML file that is parsed is a configuration file which has configuration params. Any request coming to my website will be redirected based on the information that is returned from this xml file. I have 2 questions around this
1) I would like to do the file parsing only once and not every time. Since, DOM parser loads the xml into memory after the first time I would like to know how to check if the file is already available in the memory? so that the following is not called everytime
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File(sFpath));
2) If the xml file changes how do I make sure the new chaged xml file is re-loaded.
Thanks,

The DOM is an intermediate format - parse it in some application-specific (and friendly) object structure, and stash that in a singleton. You don't want to go hunting through a DOM for every web request. Then, regularly (every x minutes or y web requests), check whether the file has been updated, re-parse it, and update your singleton.

The reading of file should be implemented separately,i mean not along with the code that you handle requests or maybe in static initialization block and then you can use a file watcher to detect file changes.Options for file watching :
File Watcher
WatchService API(Java 7)
JFileNotify

You can keep the DOM in application memory just like any other data - the details depend on what application server / framework you are using. But DOM is a poor choice, not just because of its clumsy API, but also because DOM is not thread-safe, so all access would need to be synchronized. You're better off with a tree model that is read-only once parsed. Consider using Saxon and XPath/XQuery for this - load the tree once into a read only Saxon tree that can then be repeatedly accessed using XPath or XQuery, invoked from your Java application.
Creating Java classes to represent your configuration data more explicitly, as suggested by cdegroot, is an alternative, but not really necessary in my view. It will probably involve more work for you each time you add something to the configuration file.

Related

SAXBuilder().build(InputStream) - does this read entire file into memory?

Reading the docs, this is the method used in all the examples I've seen:
(Version of org.jdom.input.SAXBuilder is jdom-1.1.jar)
Document doc = new SAXBuilder().build(is);
Element root = doc.getRootElement();
Element child = root.getChild("someChildElement");
...
where is is an InputStream variable.
I'm wondering, since this is a SAX builder (as opposed to a DOM builder), does the entire inputstream get read into the document object with the build method? Or is it working off a lazy load and as long as I request elements with Element.getChildren() or similar functions (stemming from the root node) that are forward-only through the document, then the builder automatically takes care of loading chunks of the stream for me?
I need to be sure I'm not loading the whole file into memory.
Thanks,
Mike

The DOM parser similarly to the JDom parser loads the whole XML resource in memory to provide you a Document instance allowing to navigate in the elements of the XML.
Some references here :
the DOM standard is a codified standard for an in-memory document
model.
And here :
JDOM works on the logical in-memory XML tree,
Both DOM and JDom use the SAX parser internally to read the XML resource but they use it only to store the whole content in the Document instance that they return. Indeed, with Dom and JDom, the client never needs to provide a handler to intercept events triggered by the SAX parser.
Note that both DOM and JDom don't have any obligation to use SAX internally.
They use them mainly as the SAX standard is already there and so it makes sense to use it for reporting errors.
I need to be sure I'm not loading the whole file into memory.
You have two programming models to work with XML: streaming and the document object model (DOM).
You are looking for the first one.
So use the SAX parser by providing your handler to handle events generated by the SAX parser (startDocument(), startElement(), and so for) or as alternative look at a more user friendly API : STAX (Streaming API for XML) :
As an API in the JAXP family, StAX can be compared, among other APIs,
to SAX, TrAX, and JDOM. Of the latter two, StAX is not as powerful or
flexible as TrAX or JDOM, but neither does it require as much memory
or processor load to be useful, and StAX can, in many cases,
outperform the DOM-based APIs. The same arguments outlined above,
weighing the cost/benefits of the DOM model versus the streaming
model, apply here.

It eagerly parses the whole file to build the in-memory representation (i.e. Document) of the XML file.
If you want to be absolutely certain of that, you can go through the source on GitHub. More importantly the following classes: SAXBuilder, SAXHandler, and Document.

How to efficiently make a large XML file searchable in a web application?

I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.
So my question is, given a search term do I:
Do I load the document in memory once (into a list of beans and then
store it in memory)? And then search it when need be?
or
Parse the document looking for the desired search term and only add
the matches to the list of beans? And repeat this process with each
search?
I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?
Regards,
Nate

When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.

Assuming your search field is a field that is known to you, for example let the structure of the xml be:
<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>
and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.
To understand the difference between STAX and SAX, please refer:
When should I choose SAX over StAX?
Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.
Note: Only x and its children will be loaded to memory, not the entire document parsed till now.
Do not use any approaches that use DOM parsers.
Sample code to load only the part of the document where the search field is present.
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
xsr.nextTag();
}
JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();
X x = jb.getValue();
System.out.println(x.y.content);
Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.

Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.
Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.

How to limit scope for XPath

I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.

I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.

How does file loading in DOM work?

I've been looking at loading XML files with Java and I just can't seem to decipher a certain part of it.
I understand that SAX is a streaming mechanism, but when talking about DOM, various sites talk about the model "loading in the full file" or "loading in the all tags", supported by the recommendation to use SAX with large XML files.
To what degree does DOM actually load the full file? The second I access the root node, does it allocate program memory for every single byte of the file? Does it only load tags until the lowest level when it loads text contents?
I'm going to be working with large files, but random access would be useful and editing is a requirement, so I believe DOM is the best choice for me.
Thanks a lot.

It does load the entire file and constructs a tree structure in memory. Thus, every single tag, attribute and any nested tags (no matter how many levels of nesting) will be loaded. It is just that the constructed tree grows bigger the larger the XML file you have.

Yes, DOM reads the whole document, parses it, and places it in memory.

If you're parsing using DOM, you do something similar to this:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(file);
(inside a try/catch)
The moment the parse is executed, the Document doc variable will contain the entire document represented as a DOM hierarchy.

Parsing a xml file using Java

I need to parse a xml file using JAVA and have to create a bean out of that xml file after parsing .
I need this while using Spring JMS in which producer is producing a xml file .First I need to read the xml file and take action according .
I read some thing about parsing and come with these option
xpath
DOM
Which ll be the best option to parse the xml file.

did you check JAXB

There's three ways of parsing an XML file, SAX, DOM and StAX.
DOM will parse the whole file and build up a tree in memory - great for small files but obviously if this is huge then you don't want the entire tree just sitting in memory! SAX is event based - it doesn't load anything into memory per-se but just fires off a series of events as it reads through the file. StAX is a median between the two, the application moves the cursor forward as it needs, grabbing the data as it goes (so no event firing or huge memory consumption.)
What one you use will really depend on your application - all have built in libraries since Java 6.

Looks like, you receive a serialized object via Java messaging. Have a look first, how the object is being serialized. Usually this is done with a library (jaxb, axis, ...) and you could use the very same library to create a deserializer.
You will need:
The xml schema (a xsd file)
The Java bean class (very helpful, it should exist)
Then, usually the library will create all helper classes and files and you don't have to care about parsing.

if you need to create an object, just extract the needed properties and go on...
I recommend using StaX, see this tutorial for more information.

Umh..there are several ways you can parse an xml document to into memory and work with it. You mentioned DOM. DOM actually holds uploads the whole document into memory and then allows you to move between different branches of the XML document.
On the other hand, you could use StAX. It works similar to DOM. The only difference is that, it streams the content of the XML document thus allowing better allocation of memory. On the other hand, it does not retain the information that has already been read.
Look at : http://download.oracle.com/javaee/5/tutorial/doc/bnbem.html It gives details about both parsing methods and example code. Hope that helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.