Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have an xml file with 100 000 fragments with 6 fields in every fragment.I want to search in that xml for different strings at different times.
What is the best xml reader for java?
OK, let's say you've got a million elements of size 50 characters each, say 50Mb of raw XML. In DOM that may well occupy 500Mb of memory, with a more compact representation such as Saxon's TinyTree it might be 250Mb. That's not impossibly big by today's standards.
If you're doing many searches of the same document, then the key factor is search speed rather than parsing speed. You don't want to be doing SAX parsing as some people have suggested because that would mean parsing the document every time you do a search.
The next question, I think, is what kind of search you are doing. You suggest you are basically looking for strings in the content, but it's not clear to what extent these are sensitive to the structure. Let's suppose you are searching using XPath or XQuery. I would suggest three possible implementations:
a) use an in-memory XQuery processor such as Saxon. Parse the document into Saxon's internal tree representation, making sure you allocate enough memory. Then search it as often as you like using XQuery expressions. If you use the Home Edition of Saxon, the search will typically be a sequential search with no indexing support.
b) use an XML database such as MarkLogic or eXist. Initial processing of the document to load the database will take a bit longer, but it won't tie up so much memory, and you can make queries faster by defining indexes.
c) consider use of Lux (http://luxdb.org) which is something of a hybrid: it uses the Saxon XQuery processor on top of Lucene, which is a free text database. It seems specifically designed for the kind of scenario you are describing. I haven't used it myself.
Are you loading the XML document into memory once and then searching it many times? In that case, it's not so much the speed of parsing that should be the concern, but rather the speed of searching. But if you are parsing the document once for every search, then it's fast parsing you need. The other factors are the nature of your searches, and the way in which you want to present the results.
You ask what is the "best" xml reader in the body of your question, but in the title you ask for the "fastest". It's not always true that the best choice is the fastest. because parsing is a mature technology, different parsing approaches might only differ by a few microseconds in performance. Would you be prepared to have four times as much development effort in return for 5% faster performance?
The solution to handling very big XML files is to use a SAX parser. With DOM parsing, any library would really fail with very big XML file. Well, failing is relative to the amount of memory you have and how efficient is the DOM parser.
But anyway, handling large XML files requires SAX parser. Consider SAX as something which just throw elements out the XML file. It is an even based sequential parser. Even based because you are thrown with elements such as start element, end element. You have to know which element you are interested in getting and handle them properly.
I would advise you to play with this simple example to understand SAX,
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
Related
I'm part of a team creating a data store that passes information around in large XML documents (herein called messages). On the back end, the messages get shredded apart and stored in accumulo in pieces. When a caller requests data, the pieces get reassembled into a message tailored for the caller. The schemas are somewhat complicated so we couldn't use JAXB out of the box. The team (this is a few years ago) assumed that DOM wasn't performant. We're now buried in layer after layer of half-broken parsing code that will take months to finish, will break the second someone changes the schema, and is making me want to jam a soldering iron into my eyeball. As far as I can tell, if we switch to using the DOM method a lot of this fart code can be cut and the code base will be more resilient to future changes. My team lead is telling me that there's a performance hit in using the DOM, but I can't find any data that validates that assumption that isn't from 2006 or earlier.
Is parsing large XML documents via DOM still sufficiently slow to warrant all the pain that XMLBeans is causing us?
edit 1 In response to some of your comments:
1) This is a government project so I can't get rid of the XML part (as much as I really want to).
2) The issue with JAXB, as I understand it, had to do with the substitution groups present in our schemas. Also, maybe I should restate the issue with JAXB being one of the ratio of effort/return in using it.
3) What I'm looking for is some kind of recent data supporting/disproving the contention that using XMLBeans is worth the pain we're going through writing a bazillion lines of brittle binding code because it gives us an edge in terms of performance. Something like Joox looks so much easier to deal with, and I'm pretty sure we can still validate the result after the server has reassembled a shredded message before sending it back to the caller.
So does anyone out there in SO land know of any data germane to this issue that's no more than five years old?
Data binding solutions like XMLBeans can perform very well, but in my experience they can become quite unmanageable if the schema is complex or changes frequently.
If you're considering DOM, then don't use DOM, but one of the other tree-based XML models such as JDOM2 or XOM. They are much better designed.
Better still (but it's probably too radical a step given where you are starting) don't process your XML data in Java at all, but use an XRX architecture where you use XML-based technologies end-to-end: XProc, XForms, XQuery, XSLT.
I think from your description that you need to focus on cleaning up your application architecture rather than on performance. Once you've cleaned it up, performance investigation and tuning will be vastly easier.
If you want the best technology for heavy duty XML processing, you might want to investigate this paper. The best technology will no doubt be clear after you read it...
The paper details :
Processing XML with Java – A Performance Benchmark
Bruno Oliveira1 ,Vasco Santos1 and Orlando Belo2 1 CIICESI,
School of Management and Technology,
Polytechnic of Porto Felgueiras, PORTUGAL
2 Algoritmi R&D Centre, University of Minho
4710-057 Braga, PORTUGAL
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
What I think I'm looking for is a no-SQL, library-embedded, on disk (ie not in-memory) database, thats accessible from java (and preferably runs inside my instance of the JVM). That's not really much of a database, and I'm tempted to roll-my-own. Basically I'm looking for the "should we keep this in memory or put it on disk" portion of a database.
Our model has grown to several gigabytes. Right now this is all done in memory, meaning we're pushing the JVM for upward of several gigabytes. It's currently all stored in a flat XML file, serialized and deserialized with xstream and compressed with Java'a built in gzip libraries. That's worked well when our model stays under 100MB, but now that its larger than that its becoming a problem.
loosely speaking that model can be broken down as
Project
configuration component (directed-acyclic-graph), not at all database friendly
a list of a dozen "experiment" structures
each containing a list of about a dozen "run-model" structures.
each run-model contains hundreds of megs of data. Once written they are never edited.
What I'd like to do is have something that conforms to a map interface, of guid -> run-model. This mini-database would keep a flat table of these objects. On our experiment model, we would replace the list of run-models with a list of guids, and add, at the application layer, a get call to this map, which would pull it off the disk and into memory.
That means we can keep configuration of our program in XML (which I'm very happy with) and keep a table of the big data in a DBMS that will keep us from consuming multi-GB of memory. On program start and exit I could then load and unload the two portions of our model (the config section in XML, and the run-models in the database format) from an archiving format.
I'm sort've feeling gung-ho about this, and think that I could probably implement it with some of X-Stream's XML inspection strategies and a custom map implementation, but something a voice in the back of my head is telling me I should find a library to do it instead.
Should I roll my own or is there a database that's small enough to fit this bill?
Thanks guys,
-Geoff
http://www.mapdb.org/
Also take a look at this question: Alternative to BerkeleyDB?
Since MapDB is a possible solution for your problem, Chronicle Map is also worth consideration. It's an embeddable Java key-value store, optionally persistent, offering a very similar programming model to MapDB: it also via the vanilla java.util.Map interface and transparent serialization of keys and values.
The major difference is that according to third-party benchmarks, Chronicle Map is times faster than MapDB.
Regarding stability, no bugs were reported about the Chronicle Map data storage for months now, while it is in active use in many projects.
Disclaimer: I'm the developer of Chronicle Map.
From "parsing speed" point of view, how much influence(if any) has number of attributes and depth of XML document on parsing speed?
Is it better to use more elements or as many attributes as possible?
Is "deep" XML structure hard to read?
I am aware that if I would use more attributes, XML would be not so heavy and that adapting XML to parser is not right way to create XML file
thanks
I think, it depends on whether you are doing validation or not. If you are validating against a large and complex schema, then proportionately more time is likely to be spent doing the validation ... than for a simple schema.
For non-validating parsers, the complexity of the schema probably doesn't matter much. The performance will be dominated by the size of the XML.
And of course performance also depends the kind of parser you are using. A DOM parser will generally be slower because you have to build a complete in-memory representation before you start. With a SAX parser, you can just cherry-pick the parts you need.
Note however that my answer is based on intuition. I'm not aware of anyone having tried to measure the effects of XML complexity on performance in a scientific fashion. For a start, it is difficult to actually characterize XML complexity. And people are generally more interested in comparing parsers for a given sample XML than in teasing out whether input complexity is a factor.
Performance is a property of an implementation. Different parsers are different. Don't try to get theoretical answers about performance, just measure it.
Is it better to use more elements or as many attributes as possible?
What has that got to do with performance of parsing? I find it very hard to believe that any difference in performance will justify distorting your XML design. On the contrary, using a distorted XML design in the belief that it will improve parsing speed will almost certainly end up giving you large extra costs in the applications that generate and consume the XML.
If you are using Sax parser it does not matter whether XML is a large one or not as it is a top down parser and not hold the full XML at memory but For DOM it matters as it holds the full XML in memory. You can get some idea about comparison of XML parsers in my blogpost here
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm processing hundreds of thousands of files. Potentially millions later on down the road. A bad file will contain a text version of an excel spreadsheet or other text that isn't binary but also isn't sentences. Such files cause CoreNLP to blow up (technically, these files take a long time to process such as 15 seconds per kilobyte of text.) I'd love to detect these files and discard them in sub-second time.
What I am considering is taking a few thousand files at random, examining the first, say, 200 characters and looking for the distribution of characters to determine what is legic and what is an outlier. Example, if there are no punctuation marks or too many of them. Does this seem like a good approach? Is there a better one that has been proven? I think, for sure, this will work well enough, possibly throwing out potentially good files but rarely.
Another idea is to simply run with annotators tokenize and ssplit and do word and sentence count. That seems to do a good job as well and returns quickly. I can think of cases where this might fail as well, possibly.
This kind of processing pipeline is always in a state of continuous improvement. To kick off that process, the first thing I would build is an instrument around the timing behavior of CoreNLP. If CoreNLP is taking too long, kick out the offending file into a separate queue. If this isn't good enough, you can write recognizers for the most common things in the takes-too-long queue and divert them before they hit CoreNLP. The main advantage of this approach is that it works with inputs that you don't expect in advance.
There are two main approaches to this kind of problem.
The first is to take the approach you are considering in which you examine the contents of the file and decide whether it is acceptable text or not based on a statistical analysis of the data in the file.
The second approach is to use some kind of meta tag such as a file extension to at least eliminate those files that are pretty certainly to be a problem (.pdf, .jpg, etc.).
I would suggest a mixture of the two approaches so as to cut down on the amount of processing.
You might consider a pipeline approach in which you have a sequence of tests. The first test filters out files based on meta data such as the file extension, the second step then does a preliminary statistical check on the first few bytes of the file to filter out obvious problem files, a third step does a more involved statistical analysis of the text, and the fourth handles the CoreNLP rejection step.
You do not say where the files originate nor if there are any language considerations (English versus French versus Simplified Chinese text). For instance are the acceptable text files using UTF-8, UTF-16, or some other encoding for the text?
Also is it possible for the CoreNLP application to be more graceful about detecting and rejecting incompatible text files?
Could you not just train a Naive Bayes Classifier to recognize the bad files? For features use things like (binned) percentage of punctuation, percentage of numerical characters, and average sentence length.
Peter,
You are clearly dealing with files for ediscovery. Anything and everything is possible, and as you know, anything kicked out must be logged as an exception. I've faced this, and have heard the same from other analytics processors.
Some of the solutions above, pre-process and in-line can help. In some ediscovery solutions it may be feasible to dump text into a field in SQL and truncate, or otherwise truncate, and still get what you need. In other apps, anything to do with semantic clustering, or predictive coding, it may be better to use pre-filters using metadata (e.g. file type), document type classification libraries, and entity extraction based upon prior examples, current sampling or your best guess as to the nature of "bad file" contents.
Good luck.
I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.
I tried to use java nio file channels and memory mapped files but was hard to use with Xpath.
So can someone tell a way to do it ?
A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.
If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.
If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.
On *nix you have the utility called xpath exactly for that.
Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.
If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search Stackoverflow for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?