I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.
I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.
Related
I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.
So my question is, given a search term do I:
Do I load the document in memory once (into a list of beans and then
store it in memory)? And then search it when need be?
or
Parse the document looking for the desired search term and only add
the matches to the list of beans? And repeat this process with each
search?
I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?
Regards,
Nate
When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.
Assuming your search field is a field that is known to you, for example let the structure of the xml be:
<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>
and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.
To understand the difference between STAX and SAX, please refer:
When should I choose SAX over StAX?
Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.
Note: Only x and its children will be loaded to memory, not the entire document parsed till now.
Do not use any approaches that use DOM parsers.
Sample code to load only the part of the document where the search field is present.
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
xsr.nextTag();
}
JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();
X x = jb.getValue();
System.out.println(x.y.content);
Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.
Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.
Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.
From some of our other application i am getting XML file.
I want to read that XML file node by node and store node values in database for further use.
So, what is the best way/API to read XML file and retrieve node values using Java?
There are various tools for that. Today, I prefer two:
Simple XML
JAXB
StAX
Here is a good comparison between the Simple and JAXB: http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html
Personally, I like Simple a bit better because support by Niall is excellent but JAXB (as explained in the blog post above) can produce better output with less code.
StAX is a more basic API which allows you to read XML documents that simply don't fit into RAM (neither Simple nor JAXB allow you to read an XML document "object by object" - they will always try to load everything into RAM at once).
I would advice for a simple XML tool if you can manage by that.
For example I and my colleges have introduces complex XML frameworks that worked like a charm at first.
Then you forget about the framework, you have special build files just for mapping XML to beans, you have annotated beans, you provide a new barrier for new developers to your project. You loose much of your freedom to refactor.
At the end you will be sorry that you used the complex framework to save some time in the beginning and I have seen more than one time that the frameworks have been thrown out in refactoring because everybody had a negative feeling about it although they are great at paper.
So think twice about introducing complex XML frameworks if you seldom use them. If you and your team use them rather frequently then they are the way to go.
I suggest using XPath. Xalan is already included in the JDK (no external jars needed) and it fits your requirement, i.e. iterating through element nodes (i presume) and storing their text values. For example:
String xml = "<root> <item>One</item> <item>Two</item> <item>Three</item> </root>";
XPathFactory xpf = XPathFactory.newInstance();
InputSource is = new InputSource(new StringReader(xml));
NodeList nodes = (NodeList) xpf.newXPath().evaluate("/*/*", is,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getNodeName() + " -> " + e.getTextContent());
}
}
This example returns a list of all non-root elements and print out the corresponding element name and text content. Adapt the xpath expression to fit your needs.
dom4j and jdom are pretty easy to use (ignoring the requirement "best" for a moment ;) )
Try Apache Xerces. It is mature and robust. Any such available alternatives will do also, just be sure not to roll out your own implementation.
Bypassing alltogether the question of parsing the xml and storing the values in a database, I'd like to question the need to do the above. Most databases can handle xml nowadays, so it can be stored in some way into a table without the need of parsing the content; and the content of such an xml within a column in a table can typically be queried by use of 'xmlselect()' and similar functions.
Think about this for a second; if in the near or distant future the content of the xml that you get from the other application changes, you'll have plenty of changes to do. If it changes often, it'll become a nightmare.
Cheers,
Wim
Try XStream, this one's really simple.
well,i used stax to parse quite a huge of XML nodes, which consumes less memory than Dom and sax, cauz it is of style of pulling XML data. Stax might be a good choice for large XML data nodes.
I've started using VTD (I guess VTD-XML) in Java, and for XPath reads it's excellent. Where i'm hitting an issue now is with inserting data. Lets say I am doing the following:
VTDNav nav = preExistingGen.getNav();
AutoPilot pilot = new AutoPilot(nav);
pilot.selectXPath("/Something/SomethingElse");
if (pilot.evalXPath() != -1) {
XMLModifier modifier = new XMLModifier(nav);
modifier.insertAfterElement("<some>content</some>");
}
What I had assumed was this was a real-time update, which would be reflected in the VTDNav. It looks like my understanding is incorrect, since simply inserting the element content does nothing to the nav (if I output the VTDNav, it still contains my original xml). The only way I can seem to get a handle on the new xml, is by outputting it from the XMLModifier.
modifier.outputAndReparse(); // Gives me a new VTDNav with the new content
Is there something i'm missing here? Is there an easier way of doing this? I wanted to be able to insert the new content, and then immediately get the new index. My existing code (using the standard DOM classes) has a ton of inserts and updates, and I also need to know where the last inserted element existed in the document. Having to outputAndReparse() everytime and then find the inserted element (which I may not even be able to guarantee) doesn't seem like a plausible solution.
I think the answer is to plan your modification and subsequent access to new content carefully. If you insert the new content, and try to access the new content immediately afterwards, insertAndParse() is the way to go. But as you can see, it is rather slow because of the reparsing. My suggestion is that you plan as much as insert all at once, then call reparsing just once, it will be a lot more efficient this way.
The spirit is that VTD-XML is not trying to be DOM, it has its own strengths and weaknesses... and this is one of the weakness, but you can work around it ...
And when you try to merge multiple xml files, vtd-xml will certainly shine....
Also if you tag this question with vtd-xml, i will be able to find it much easier.
modifier.remove():
in the text file stored no.of xpaths, val so for each xpath generate one new output xxx.xml file. when u use xml modifier previous data along with current data it will write in a xxx.xml file so for elimination of previous data and only for current xpath data changes write into the new xxx.xml file so for that use the modifier.remove();
xm2.output( new FileOutputStream("/home/cupola-hadoop-project/TotalEnvironment/document/link/"+j+"101new.xml"));
xm2.remove();
and rotate the loop desired times.
I've been looking at loading XML files with Java and I just can't seem to decipher a certain part of it.
I understand that SAX is a streaming mechanism, but when talking about DOM, various sites talk about the model "loading in the full file" or "loading in the all tags", supported by the recommendation to use SAX with large XML files.
To what degree does DOM actually load the full file? The second I access the root node, does it allocate program memory for every single byte of the file? Does it only load tags until the lowest level when it loads text contents?
I'm going to be working with large files, but random access would be useful and editing is a requirement, so I believe DOM is the best choice for me.
Thanks a lot.
It does load the entire file and constructs a tree structure in memory. Thus, every single tag, attribute and any nested tags (no matter how many levels of nesting) will be loaded. It is just that the constructed tree grows bigger the larger the XML file you have.
Yes, DOM reads the whole document, parses it, and places it in memory.
If you're parsing using DOM, you do something similar to this:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(file);
(inside a try/catch)
The moment the parse is executed, the Document doc variable will contain the entire document represented as a DOM hierarchy.
I've got an XML document that is in either a pre or post FO transformed state that I need to extract some information from. In the pre-case, I need to pull out two tags that represent the pageWidth and pageHeight and in the post case I need to extract the page-height and page-width parameters from a specific tag (I forget which one it is off the top of my head).
What I'm looking for is an efficient/easily maintainable way to grab these two elements. I'd like to only read the document a single time fetching the two things I need.
I initially started writing something that would use BufferedReader + FileReader, but then I'm doing string searching and it gets messy when the tags span multiple lines. I then looked at the DOMParser, which seems like it would be ideal, but I don't want to have to read the entire file into memory if I could help it as the files could potentially be large and the tags I'm looking for will nearly always be close to the top of the file. I then looked into SAXParser, but that seems like a big pile of complicated overkill for what I'm trying to accomplish.
Anybody have any advice? Or simple implementations that would accomplish my goal? Thanks.
Edit: I forgot to mention that due to various limitations I have, whatever I use has to be "builtin" to core Java, in which I can't use and/or download any 3rd party XML tools.
While XPath is very good for querying XML data, I am not aware of good and fast XPath implementation for Java (they all use DOM model at least).
I would recommend you to stick with StAX. It is extremely fast even for huge files, and it's cursor API is rather trivial:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader("my.xml");
try {
while (r.hasNext()) {
r.next();
. . .
}
} finally {
r.close()
}
Consult StAX tutorial and XMLStreamReader javadocs for more information.
You can use XPath to search for your tags. Here is a tutorial on forming XPath expressions. And here is an article on using XPath with Java.
An easy to use parser (dom, sax) is dom4j. It would be quite easier to use than the built-in SAXParser.
try "XMLDog"
This uses sax to evaluate xpaths