Does the complexity of XML structure has influence on parsing speed? - java

From "parsing speed" point of view, how much influence(if any) has number of attributes and depth of XML document on parsing speed?
Is it better to use more elements or as many attributes as possible?
Is "deep" XML structure hard to read?
I am aware that if I would use more attributes, XML would be not so heavy and that adapting XML to parser is not right way to create XML file
thanks

I think, it depends on whether you are doing validation or not. If you are validating against a large and complex schema, then proportionately more time is likely to be spent doing the validation ... than for a simple schema.
For non-validating parsers, the complexity of the schema probably doesn't matter much. The performance will be dominated by the size of the XML.
And of course performance also depends the kind of parser you are using. A DOM parser will generally be slower because you have to build a complete in-memory representation before you start. With a SAX parser, you can just cherry-pick the parts you need.
Note however that my answer is based on intuition. I'm not aware of anyone having tried to measure the effects of XML complexity on performance in a scientific fashion. For a start, it is difficult to actually characterize XML complexity. And people are generally more interested in comparing parsers for a given sample XML than in teasing out whether input complexity is a factor.

Performance is a property of an implementation. Different parsers are different. Don't try to get theoretical answers about performance, just measure it.
Is it better to use more elements or as many attributes as possible?
What has that got to do with performance of parsing? I find it very hard to believe that any difference in performance will justify distorting your XML design. On the contrary, using a distorted XML design in the belief that it will improve parsing speed will almost certainly end up giving you large extra costs in the applications that generate and consume the XML.

If you are using Sax parser it does not matter whether XML is a large one or not as it is a top down parser and not hold the full XML at memory but For DOM it matters as it holds the full XML in memory. You can get some idea about comparison of XML parsers in my blogpost here

Related

Xml parsing and writing txt file using multithread in java

I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.
example of xml file:
<text>
<paragraph>
<line>
<character>g</character>
<character>o</character>
.....
</line>
<line>
<character>k</character>
.....
</line>
</paragraph>
</text>
<text>
<paragraph>
<line>
<character>c</character>
.....
</line>
</paragraph>
</text>
example of text file:
go..
k..
c..
How can I parse many xml files and write many text files using multi thread in java as fast as I can?
Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?
I have no experience in multi thread. How should I build a multi-thread structure to be effective?
Any help is appreciated. Thanks in advance.
EDIT
I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?
Points to note in this situation.
In many cases, attempting to write to multiple files at once using multi-threading is utterly pointless. All this generally does is exercise the disk heads more than necessary.
Writing to disk while parsing is also likely a bottleneck. You would be better to parse the xml into a buffer and then writing the whole buffer to disk in one hit.
The speed of your parser is unlikely to affect the overall time for the process significantly. Your system will almost certainly spend much more time reading and writing than parsing.
A quick check with some real test data would be invaluable. Try to get a good estimate of the amount of time you will not be able to affect.
Determine an approximate total read time by reading a few thousand sample files into memory because that time will still need to be taken however parallel you make the process.
Estimate an approximate total write time in a similar way.
Add the two together and compare that with your total execution time for reading, parsing and writing those same files. This should give you a good idea how much time you might save through parallelism.
Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.
First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.
However, toward the actual question:
Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.
The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?
If you write your code in XSLT (2.0 or later), using the collection() function to parse your source files, and the xsl:result-document instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.
This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.
LATER
I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character> elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:
<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
<xsl:value-of select="."/>
</xsl:template>
If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.
If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.
If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.

Fastest xml reader for large xml files with java [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have an xml file with 100 000 fragments with 6 fields in every fragment.I want to search in that xml for different strings at different times.
What is the best xml reader for java?
OK, let's say you've got a million elements of size 50 characters each, say 50Mb of raw XML. In DOM that may well occupy 500Mb of memory, with a more compact representation such as Saxon's TinyTree it might be 250Mb. That's not impossibly big by today's standards.
If you're doing many searches of the same document, then the key factor is search speed rather than parsing speed. You don't want to be doing SAX parsing as some people have suggested because that would mean parsing the document every time you do a search.
The next question, I think, is what kind of search you are doing. You suggest you are basically looking for strings in the content, but it's not clear to what extent these are sensitive to the structure. Let's suppose you are searching using XPath or XQuery. I would suggest three possible implementations:
a) use an in-memory XQuery processor such as Saxon. Parse the document into Saxon's internal tree representation, making sure you allocate enough memory. Then search it as often as you like using XQuery expressions. If you use the Home Edition of Saxon, the search will typically be a sequential search with no indexing support.
b) use an XML database such as MarkLogic or eXist. Initial processing of the document to load the database will take a bit longer, but it won't tie up so much memory, and you can make queries faster by defining indexes.
c) consider use of Lux (http://luxdb.org) which is something of a hybrid: it uses the Saxon XQuery processor on top of Lucene, which is a free text database. It seems specifically designed for the kind of scenario you are describing. I haven't used it myself.
Are you loading the XML document into memory once and then searching it many times? In that case, it's not so much the speed of parsing that should be the concern, but rather the speed of searching. But if you are parsing the document once for every search, then it's fast parsing you need. The other factors are the nature of your searches, and the way in which you want to present the results.
You ask what is the "best" xml reader in the body of your question, but in the title you ask for the "fastest". It's not always true that the best choice is the fastest. because parsing is a mature technology, different parsing approaches might only differ by a few microseconds in performance. Would you be prepared to have four times as much development effort in return for 5% faster performance?
The solution to handling very big XML files is to use a SAX parser. With DOM parsing, any library would really fail with very big XML file. Well, failing is relative to the amount of memory you have and how efficient is the DOM parser.
But anyway, handling large XML files requires SAX parser. Consider SAX as something which just throw elements out the XML file. It is an even based sequential parser. Even based because you are thrown with elements such as start element, end element. You have to know which element you are interested in getting and handle them properly.
I would advise you to play with this simple example to understand SAX,
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/

Best way to input files to Xpath

I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.
I tried to use java nio file channels and memory mapped files but was hard to use with Xpath.
So can someone tell a way to do it ?
A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.
If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.
If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.
On *nix you have the utility called xpath exactly for that.
Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.
If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search Stackoverflow for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

Memory-efficient XML manipulation in Java

We are in the process of implementing a transactional system that has two backend components:
Component A generates an initial XML response
Component B modifies the initial response XML
The resulting XML is sent back to the requestor. Since we are likely doing this under heavy load, I'd like to do this in a very CPU/memory efficient way.
What is the best way to perform the above while keeping a tight leash on overall memory utilization?
Specifically, is my best best to do a DOM parse, of the output of Component A and pass that to Component B to modify in memory? Is there a better way to do this using SAX which may be more memory efficient? Are there standard libraries that do this via SAX or DOM?
Thanks for any insights.
-Raj
Generally, SAX is more memory-efficient than DOM, because the entire document does not need to be loaded into memory for processing. The answer, however, depends on the specifics of your "Component B modifies the initial response XML" requirements.
If each change is local to its own XML sub-tree (i.e. you may need data from all nodes leading to the root of the tree, but not siblings), SAX will work better.
If the changes require referencing siblings to produce the results, DOM will work better, because it would let you avoid constructing your own data structure for storing the siblings.
An aspect or filter on componet B that applys an XSL-T transformation to the initial XML response might be a clean way to accomplish it. The memory utilization depends on the size of the request and the number of instances in memory. CPU will be dependent on these two factors as well.
DOM requires that the whole XML document be resident in memory before you modify it. If it's just a couple of elements that have to change, then SAX is a good alternative.
SAX is an event-based parsing utility. You are notified of events such as beginDocument(), startElement(), endElement(), etc. You save in memory the things you wish to save. You only control the events you want, which can really increase the speed of parsing and decrease the use of memory. It can be memory efficient, depending on what and how much of the things you are saving in memory. For the general case, SAX is more memory efficient versus DOM. DOM will save the entire document in memory in order to process it.

Alternative to XSLT?

on my project I have a huuuuge XSLT used to convert some XML files to HTML.
The problem is that this file is growing up day by day, it's hard to read, debug and test.
So I was thinking about moving all the parsing process to Java.
Do you think is a good idea? In case what libraries to parse XML and generate HTML(XML) do u suggest? performances will be better or worse?
If it's not a good idea any alternative idea?
Thanks
Randomize
Take a look at CDuce - it is a strictly typed, statically compiled XML processing language.
I once had a client with a similar problem - thousands of lines of XSLT, growing all the time. I spent an hour reading it with increasing incredulity, then rewrote it in 20 lines of XSLT.
Refactoring is often a good idea, and the worse the code is, the more worthwhile refactoring is. But there's no reason to believe that just because the code is bad and in need of refactoring, you need to change to a different programming language. XST is actually very good at handling variety and complexity if you know how to use it properly.
It's possible that the code is an accumulation of special handling of special cases, and each new special case discovered results in more rules being added. That's a tough problem to tackle in any language, but XSLT can deal with it better than most, provided you apply your mind all the time to finding abstract general rules that encompass all the special rules, so you only need to code the special rules as exceptions.
I'd consider Velocity as an alternative. I prefer it to XSL-T. The transforms are harder to write than templates, because the latter look exactly like the XML I wish to produce. It's a simple thing to add in the markup to map in the data.

Categories