I need to parse 70mb data with Java and I've currently a xml document (1-level, no children), where each document has multiple fields.
I was wondering if I should replace it with a simpler text file in which each row is a doc, and the fields are comma separated.
Is this going to significantly improve performances ? And what if the I had, for instance, 4GB data instead ?
thanks
It would probably be more efficient to use the text file than the XML file if you ever get to a point where you can't fit the whole data set into memory at once. at that point, being able to parse the text file line by line would be better than the XML approach (which I believe loads the whole file into memory).
According to a Robin Green XML only parses the whole file at once if you use DOM - SAX parsing streams.
There are other ways to persist data like this:
Database
Can this data be represented in a database? Java has easy support for most database systems, and you just have to install the right libraries to do so.
Java Properties
An alternative is the java properties system. This lets you put all your data on a file and then load them back and java parses the file when loading it.
Related
I'm building a Java application that will have a big XML file which will contain lots of information. I will also have another file - probably binary - that will have indexes to organize that XML file. So, for example, if the XML file has 1000 registers and accessing the binary file I find out that I want to load the register number 687, my idea was loading it directly from de XML file without the need to load the whole XML file into memory.
My first idea was use the Java DOM (Document Object Model) to work with the XML. To parse a XML file into Java Objects, I read that the best method to use is DocumentBuilder's method parse, which returns a tree representing the XML file.
So, my question is: When I use this method, it will have to load the whole XML file into memory to parse the file or it will do it slowly depending on which parts of the file I need to use in my program?
Sorry if I made dumb questions, any kind of help/advice is welcome! Remember that my objective is to access the disk the fewest times possible! Thanks!!
I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).
I want to use XML for storing some data. But I do not want read full file when I want to get the last data that was inserted there, as well as I do not want to rewrite full file when adding new data there. Is there a standard way in java to parse xml file not from the beginning but from the end. So that for example SAX or StaX parser will first encounter last closing root tag and than last tag. Or if I want to do this I should read and write everything like I am reading/writing regular text file?
Fundamentally, XML is a poor representation choice for this. The format is inherently "contained" like this, and I haven't seen any APIs which encourage you to fight against that.
Options:
Choose a different format entirely (e.g. use a database)
Create lots of small XML files instead - each one self-contained. When you want the whole of the data, read all the files
Just swallow the hit and read/write the whole file each time.
I found a good topic on this with example solutions for what I want.
This link: http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
Seems that XML is not good file format to achieve what I want. There is no standard parser that can parse XML from the end instead of beginning.
Probably the best solution for will be storing all xml data in one file that contains composition of many xml files contents. On each line stored separate contents of XML. The file itself is not well formed XML but each line contains well formed xml that I will parse using standard xml parser(StaX).
This way I will be able to read just lines from the end of file and append new data to the end of file. When I need the whole data or only the part of it I will read all line or part of them. Probably I can also implement pagination from the end of file for that because the file can be big.
Why XML in each line? I think it is easy to use API for parsing it as well as it is human readable to store data in xml instead of just separating values in the line with some symbol.
Why not use sax/stax and simply process only your last entry? Yes, it will need to open and go through the whole file, but at least it's fairly efficient as opposed to loading the whole DOM tree.
Short of doing that, I don't think you can do what you're asking using XML as a source.
Another alternative, apart from the ones provided by Jon Skeet in his answer, would be to keep the same format but insert the latest entries first, and stop processing the files as soon as you've read your entry.
So say you have a file that is written in XML or soe other coding language of that sort. Is it possible to just rewrite one line rather than getting the entire file into a string, then changing then line, then having to rewrite the whole string back to the file?
In general, no. File systems don't usually support the idea of inserting or modifying data in the middle of a file.
If your data file is in a fixed-size record format then you can edit a record without overwriting the rest of the file. For something like XML, you could in theory overwrite one value with a shorter one by inserting semantically-irrelevant whitespace, but you wouldn't be able to write a larger value.
In most cases it's simpler to just rewrite the whole file - either by reading and writing in a streaming fashion if you can (and if the file is too large to read into memory in one go) or just by loading the whole file into some in-memory data structure (e.g. XDocument), making the changes, and then saving the file again. (You may want to consider saving to a different file then moving the files around to avoid losing data if the save operation fails for some reason.)
If all of this ends up being too expensive, you should consider using multiple files (so each one is smaller) or a database.
If the line you want to replace is larger than the new line that you want to replace it with, then it is possible as long as it is acceptable to have some kind of padding (for example white-space characters ' ') which will not effect your application.
If on the other hand the new content are larger than the content to be replaced you will need to shift all the data downwards, so you need to rewrite the file, or at least from the replaced line onwards.
Since you mention XML, it might be you are approaching your problem in the wrong way. Could it be that what you need is to replace a specific XML node? In which case you might consider using DOM to read the XML into a hierarchy of nodes and adding/updating/removing in there before writing the XML tree back to the file.
I need to parse a xml file using JAVA and have to create a bean out of that xml file after parsing .
I need this while using Spring JMS in which producer is producing a xml file .First I need to read the xml file and take action according .
I read some thing about parsing and come with these option
xpath
DOM
Which ll be the best option to parse the xml file.
did you check JAXB
There's three ways of parsing an XML file, SAX, DOM and StAX.
DOM will parse the whole file and build up a tree in memory - great for small files but obviously if this is huge then you don't want the entire tree just sitting in memory! SAX is event based - it doesn't load anything into memory per-se but just fires off a series of events as it reads through the file. StAX is a median between the two, the application moves the cursor forward as it needs, grabbing the data as it goes (so no event firing or huge memory consumption.)
What one you use will really depend on your application - all have built in libraries since Java 6.
Looks like, you receive a serialized object via Java messaging. Have a look first, how the object is being serialized. Usually this is done with a library (jaxb, axis, ...) and you could use the very same library to create a deserializer.
You will need:
The xml schema (a xsd file)
The Java bean class (very helpful, it should exist)
Then, usually the library will create all helper classes and files and you don't have to care about parsing.
if you need to create an object, just extract the needed properties and go on...
I recommend using StaX, see this tutorial for more information.
Umh..there are several ways you can parse an xml document to into memory and work with it. You mentioned DOM. DOM actually holds uploads the whole document into memory and then allows you to move between different branches of the XML document.
On the other hand, you could use StAX. It works similar to DOM. The only difference is that, it streams the content of the XML document thus allowing better allocation of memory. On the other hand, it does not retain the information that has already been read.
Look at : http://download.oracle.com/javaee/5/tutorial/doc/bnbem.html It gives details about both parsing methods and example code. Hope that helps.