Read entire XML file and store in memory java

Read entire XML file and store in memory java - java

Here is what I am trying to do.
Read entire XML file (I do not care the name of element or attribute,etc..).
Save the read XML file into memory.
Update some values of read xml file.
Write back to a XML file.
I am trying to use XMLStreamReader to read a XML file, however all the examples I see so far, it looks like i have to provide element name. But, I do not care about element names, just want to read entier XML file into memory. And, I am not sure how datatype I should be storing as I am reading. I am thinking to store them into Document datatype.
Any suggestions on how to read entire XML file and store read contents in memeory?
Thanks.

The easiest way to do this would be with JAXB. You can use xjc to generate Java classes from your XML schema. Then use JAXB to unmarhsal (load) your data, manipulate the Java object just as you normally would any other object (using getters/setters), and marshal (save) it back to an XML file.
You could also use DOM directly, but manipulating the DOM is a lot more tedious than working with POJOs that directly mirror your XML structure.

Sometimes JAXB could be overkill, and DOM is ok for a few manipulations:
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = f.newDocumentBuilder();
Document doc = builder.parse(new File("test.xml"));

XMLStreamReader is an event-oriented parser that gives you a stream-like view of the XML. You apparently want the entire XML tree in memory, so the best option would be JDOM:
public Document parse(Reader in) {
return new SAXBuilder().build(in);
}
The advantage of JDOM over JAXB is navigation: you can fetch an element deep into the XML tree with a simple expression like "//my-elem". With JAXB you're going to write one barnful of code and nested loops to get this simple result.
(Not to mention that without the XML Schema of your XML, JAXB won't even talk to you.)

Related

SAXBuilder().build(InputStream) - does this read entire file into memory?

Reading the docs, this is the method used in all the examples I've seen:
(Version of org.jdom.input.SAXBuilder is jdom-1.1.jar)
Document doc = new SAXBuilder().build(is);
Element root = doc.getRootElement();
Element child = root.getChild("someChildElement");
...
where is is an InputStream variable.
I'm wondering, since this is a SAX builder (as opposed to a DOM builder), does the entire inputstream get read into the document object with the build method? Or is it working off a lazy load and as long as I request elements with Element.getChildren() or similar functions (stemming from the root node) that are forward-only through the document, then the builder automatically takes care of loading chunks of the stream for me?
I need to be sure I'm not loading the whole file into memory.
Thanks,
Mike

The DOM parser similarly to the JDom parser loads the whole XML resource in memory to provide you a Document instance allowing to navigate in the elements of the XML.
Some references here :
the DOM standard is a codified standard for an in-memory document
model.
And here :
JDOM works on the logical in-memory XML tree,
Both DOM and JDom use the SAX parser internally to read the XML resource but they use it only to store the whole content in the Document instance that they return. Indeed, with Dom and JDom, the client never needs to provide a handler to intercept events triggered by the SAX parser.
Note that both DOM and JDom don't have any obligation to use SAX internally.
They use them mainly as the SAX standard is already there and so it makes sense to use it for reporting errors.
I need to be sure I'm not loading the whole file into memory.
You have two programming models to work with XML: streaming and the document object model (DOM).
You are looking for the first one.
So use the SAX parser by providing your handler to handle events generated by the SAX parser (startDocument(), startElement(), and so for) or as alternative look at a more user friendly API : STAX (Streaming API for XML) :
As an API in the JAXP family, StAX can be compared, among other APIs,
to SAX, TrAX, and JDOM. Of the latter two, StAX is not as powerful or
flexible as TrAX or JDOM, but neither does it require as much memory
or processor load to be useful, and StAX can, in many cases,
outperform the DOM-based APIs. The same arguments outlined above,
weighing the cost/benefits of the DOM model versus the streaming
model, apply here.

It eagerly parses the whole file to build the in-memory representation (i.e. Document) of the XML file.
If you want to be absolutely certain of that, you can go through the source on GitHub. More importantly the following classes: SAXBuilder, SAXHandler, and Document.

How to efficiently make a large XML file searchable in a web application?

I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.
So my question is, given a search term do I:
Do I load the document in memory once (into a list of beans and then
store it in memory)? And then search it when need be?
or
Parse the document looking for the desired search term and only add
the matches to the list of beans? And repeat this process with each
search?
I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?
Regards,
Nate

When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.

Assuming your search field is a field that is known to you, for example let the structure of the xml be:
<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>
and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.
To understand the difference between STAX and SAX, please refer:
When should I choose SAX over StAX?
Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.
Note: Only x and its children will be loaded to memory, not the entire document parsed till now.
Do not use any approaches that use DOM parsers.
Sample code to load only the part of the document where the search field is present.
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
xsr.nextTag();
}
JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();
X x = jb.getValue();
System.out.println(x.y.content);
Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.

Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.
Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.

How to limit scope for XPath

I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.

I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.

modifying xml document using xml parsers?

I have an xml stored in database table. i need to get the xml and modify few elements and put the xml back in the database.
I am thinking to use JDOM or JAXB to modify the xml elements. Could you please suggest which one is better regarding the performance?
Thanks!

JAXB and JDOM and completely different things. JAXB will serialize java objects into an XML format and vice versa. JDOM simply reads in the XML file and stores it in a DOM tree which can then be used to modify the xml itself. So better if you go for JDOM.

JAXB is to be used when you have objects where the attribute values are stored in XML hence you can parse an xml document and it gives you a java objects and then you can write these back.
Quite a bit of work if you want to simple change some values. And it doesn't work with arbitrary xml files, JAXB has it's own format linked to your object's definitions.
JDOM creates also objects but the objects used are XML objects like Element, NodeList, ...
If you just want to change some values -> why not reading the xml file as a plain text file and use string operations to make your changes.
Or of the modification is more logicaly defined -> use an XSLT and a stylesheet translator.
Googling for XSLT and Java will give you tons of examples.

How does file loading in DOM work?

I've been looking at loading XML files with Java and I just can't seem to decipher a certain part of it.
I understand that SAX is a streaming mechanism, but when talking about DOM, various sites talk about the model "loading in the full file" or "loading in the all tags", supported by the recommendation to use SAX with large XML files.
To what degree does DOM actually load the full file? The second I access the root node, does it allocate program memory for every single byte of the file? Does it only load tags until the lowest level when it loads text contents?
I'm going to be working with large files, but random access would be useful and editing is a requirement, so I believe DOM is the best choice for me.
Thanks a lot.

It does load the entire file and constructs a tree structure in memory. Thus, every single tag, attribute and any nested tags (no matter how many levels of nesting) will be loaded. It is just that the constructed tree grows bigger the larger the XML file you have.

Yes, DOM reads the whole document, parses it, and places it in memory.

If you're parsing using DOM, you do something similar to this:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(file);
(inside a try/catch)
The moment the parse is executed, the Document doc variable will contain the entire document represented as a DOM hierarchy.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.