How to write a xml database file efficiently? - java

I want to build an XML file as a datastore. It should look something like this:
<datastore>
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
....
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
</datastore>
At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I just want to write the part where a change occures. Or does DOM supports this?
I had a first look at StAX, but I am not sure if it does what I want.
Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .
Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.
I can't see that StAX supports any of this, does it?
Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?
Thanks in advance.

Updating XML is basically impossible because there's no "cheap" way to insert data.
Appending XML is not so bad. All you need to do there is seek to the end of the file, then GO BACK over the "end tag" (</datastore> in this case), and then just start writing. This is a cheap operation all told, but none of the frameworks really support this as they're all mostly designed to work with well formed, full boat XML documents, as a whole, not in pieces.
You could use a StAX like thing, but in this case, StAX isn't aware of the <datastore> tag, rather it's just aware of the <item> tags and its elements. Then you create Items and start writing, over and over and over, to the same OutputStream that you have set up.
That's the best way to do this.
But if you need to delete or change data, then you get to rewrite stuff, or do hacks, such as marking elements as "inactive", hunting them down in the XML file, seeking to the 'active="Y"' attribute, and then inplace changing the Y to N. It can be done, it will be mostly efficient, but its far and away outside what the normal XML processing frameworks let you do. If I were to do that, I'd read the entire file and keep track of those entries and note their locations within it so later I could easily seek and change them efficiently.
Then when you update something, you "inactivate" the old one, and "append" the new one. Eventually get to GC the file by rewriting it all and throwing out the old, "inactive" entries.

As a rule of thumb, XML files aren't very efficient as datastores, not for the record-based data you seem to want to use them for.
But if you've already got the file and absolutely can't do anything about it, you can use StAX XMLEventReaders and XMLEventWriters to read through a file quickly and insert/modify elements in it.
But when I say quickly, what I mean is more quickly than DOM would be, but nowhere near as effective as any relational DB.
Update: Another option you can consider is vtd-xml, although I haven't tried it in real projects, it actually looks pretty decent.

If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:
<!DOCTYPE datastore [
<!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>
The file items.xml looks like this:
<item>....</item>
<item>....</item>
<item>....</item>
with no wrapper element.
When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.
Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.

It's not very easy or efficient to partially update an XML file so you won't find much support for it as a use case.
Really it sound like you need a proper database, perhaps with a tool to export the data as XML.
If you don't want to use a DB and insist on storing the data purely as XML you might consider keeping all your items in memory as objects. Whenever a new one is added you can write all of them out to XML. It might seem inefficient, but depending on your data size might still be good enough.
If you choose this path, you might want to check out the Xstream library to make this quite easy, see stream tutorial for a quick example.

Related

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

What is the best way to modify a few fields in an XML using Java

I have a big XML which contains around 300 elements. I need to modify 2 or 3 elements in this xml using Java. I don't want to go for conventional marshalling and unmarshalling as it involves the parsing of the whole XML. How is XPath/XSLT manipulation? I know that I can easily read the data but i need to modify the same and put in back in the same XML. The primary concern here is performance. Kindly advise
Using XPath/XSLT means that you load the whole document into memory before you start to transform it. If that is a problem (e.g. document too big for memory), then you need to use another solution. That said, 300 elements doesn't sound very "big".
One alternative would be to use a StAX parser to find and change the target elements. Take a look at Is there a way to build a StAX filter chain?
It sounds like XSLT might be too heavyweight for this problem. You want to rewrite a file slightly. If you can describe each change easily (for example, you want to remove the "foo" attribute on the "bar" element), consider applying a regular expression substitution. Something like this:
String fileContents = ...
fileContents.replaceAll("<bar foo=\"\\w+\"", "<bar");

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

Android application architecture - browsing an xml file

This is more of an overall architecture question about Android, and I'm curious what the community thinks is best practice for this type of endeavor. I am developing an Android application which loads an xml file which is stored on the device. My first question is, when you are dealing with a formatted xml file in the scope of an Android application, and the main point of the application is to sort of "browse" through the nodes of the xml, is it smarter to "load the xml" (not really sure what the term is) into memory and do it that way? Or is it smarter to take the xml, write it to an internal database (still getting used to the whole SQL Lite concept), and then browse through the data that way? The latter seems like a roundabout way, but I'm trying to understand core concepts here.
This brings me to my second question. If I were to draw out how the data from this XML "flows", the immediate answer in my head as far as what I know about Android is, a bunch of ListViews. Node 1 has 2 choices. This loads two choices into a ListView. When you click on the first node, it goes to the corresponding subnode in the xml, which has, say, four choices. I create a ListView with 4 choices. So on and so forth.
Does this make logical sense? Am I looking at the approach wrong? Is there a better way to do it using a different object that makes more sense? Any references to things that have already been done for me to compare to would be helpful as well. Thanks!!
Don't convert the XML into a sqlite database. Just parse it in memory.
As far as your other questions, I'd have one activity that extends from ListActivity. Override onListItemClick() and make it start your activity again with some kind of pointer to the next element to browse.
Doing it this way will make the activity stack behave well as the user presses the back button.
A lot depends on the specific use case you have and the size of XML file. For most parts, I think you will have a heck of hard time placing your XML in the database unless you already have data model that is represented by XML and suitable for persistence. You surely don't want to do it with random XML.
If you have small XML you can always load it in memory using DOM. That will make it easy to navigate. But with large XML, you need to consider some streaming API (Stax) and read directly from file.
SQLite is relational database, so you need to store data from xml in DB only if you need to perform relational operation with data (e.g. selection/update/grouping so on). If you need just to go through DOM and do something (e.g. count specific nodes), I believe you should not parse xml to DB.

Small modification to an XML document using StAX

I'm currently trying to read in an XML file, make some minor changes (alter the value of some attributes), and write it back out again.
I have intended to use a StAX parser (javax.xml.stream.XMLStreamReader) to read in each event, see if it was one I wanted to change, and then pass it straight on to the StAX writer (javax.xml.stream.XMLStreamReader) if no changes were required.
Unfortunately, that doesn't look to be so simple - The writer has no way to take an event type and a parser object, only methods like writeAttribute and writeStartElement. Obviously I could write a big switch statement with a case for every possible type of element which can occur in an XML document, and just write it back out again, but it seems like a lot of trouble for something which seems like it should be simple.
Is there something I'm missing that makes it easy to write out a very similar XML document to the one you read in with StAX?
After a bit of mucking around, the answer seems to be to use the Event reader/writer versions rather than the Stream versions.
(i.e. javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter)
See also http://www.devx.com/tips/Tip/37795, which is what finally got me moving.
StAX works pretty well and is very fast. I used it in a project to parse XML files which are up to 20MB. I don't have a thorough analysis, but it was definitely faster than SAX.
As for your question: The difference between streaming and event-handling, AFAIK is control. With the streaming API you can walk through your document step by step and get the contents you want. Whereas the event-based API you can only handle what you are interested in.
I know this is rather old question, but if anyone else is looking for something like this, there is another alternative: Woodstox Stax2 extension API has method:
XMLStreamWriter2.copyEventFromReader(XMLStreamReader2 r, boolean preserveEventData)
which copies the currently pointed-to event from stream reader using stream writer. This is not only simple but very efficient. I have used it for similar modifications with success.
(how to get XMLStreamWriter2 etc? All Woodstox-provided instances implement these extended versions -- plus there are wrappers in case someone wants to use "basic" Stax variants, as well)

Categories