Xml parsing and writing txt file using multithread in java - java

I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.
example of xml file:
<text>
<paragraph>
<line>
<character>g</character>
<character>o</character>
.....
</line>
<line>
<character>k</character>
.....
</line>
</paragraph>
</text>
<text>
<paragraph>
<line>
<character>c</character>
.....
</line>
</paragraph>
</text>
example of text file:
go..
k..
c..
How can I parse many xml files and write many text files using multi thread in java as fast as I can?
Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?
I have no experience in multi thread. How should I build a multi-thread structure to be effective?
Any help is appreciated. Thanks in advance.
EDIT
I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?

Points to note in this situation.
In many cases, attempting to write to multiple files at once using multi-threading is utterly pointless. All this generally does is exercise the disk heads more than necessary.
Writing to disk while parsing is also likely a bottleneck. You would be better to parse the xml into a buffer and then writing the whole buffer to disk in one hit.
The speed of your parser is unlikely to affect the overall time for the process significantly. Your system will almost certainly spend much more time reading and writing than parsing.
A quick check with some real test data would be invaluable. Try to get a good estimate of the amount of time you will not be able to affect.
Determine an approximate total read time by reading a few thousand sample files into memory because that time will still need to be taken however parallel you make the process.
Estimate an approximate total write time in a similar way.
Add the two together and compare that with your total execution time for reading, parsing and writing those same files. This should give you a good idea how much time you might save through parallelism.
Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.

First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.
However, toward the actual question:
Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.
The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?

If you write your code in XSLT (2.0 or later), using the collection() function to parse your source files, and the xsl:result-document instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.
This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.
LATER
I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character> elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:
<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
<xsl:value-of select="."/>
</xsl:template>
If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.
If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.
If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.

Related

Java XML parsing DOM performance

I'm part of a team creating a data store that passes information around in large XML documents (herein called messages). On the back end, the messages get shredded apart and stored in accumulo in pieces. When a caller requests data, the pieces get reassembled into a message tailored for the caller. The schemas are somewhat complicated so we couldn't use JAXB out of the box. The team (this is a few years ago) assumed that DOM wasn't performant. We're now buried in layer after layer of half-broken parsing code that will take months to finish, will break the second someone changes the schema, and is making me want to jam a soldering iron into my eyeball. As far as I can tell, if we switch to using the DOM method a lot of this fart code can be cut and the code base will be more resilient to future changes. My team lead is telling me that there's a performance hit in using the DOM, but I can't find any data that validates that assumption that isn't from 2006 or earlier.
Is parsing large XML documents via DOM still sufficiently slow to warrant all the pain that XMLBeans is causing us?
edit 1 In response to some of your comments:
1) This is a government project so I can't get rid of the XML part (as much as I really want to).
2) The issue with JAXB, as I understand it, had to do with the substitution groups present in our schemas. Also, maybe I should restate the issue with JAXB being one of the ratio of effort/return in using it.
3) What I'm looking for is some kind of recent data supporting/disproving the contention that using XMLBeans is worth the pain we're going through writing a bazillion lines of brittle binding code because it gives us an edge in terms of performance. Something like Joox looks so much easier to deal with, and I'm pretty sure we can still validate the result after the server has reassembled a shredded message before sending it back to the caller.
So does anyone out there in SO land know of any data germane to this issue that's no more than five years old?
Data binding solutions like XMLBeans can perform very well, but in my experience they can become quite unmanageable if the schema is complex or changes frequently.
If you're considering DOM, then don't use DOM, but one of the other tree-based XML models such as JDOM2 or XOM. They are much better designed.
Better still (but it's probably too radical a step given where you are starting) don't process your XML data in Java at all, but use an XRX architecture where you use XML-based technologies end-to-end: XProc, XForms, XQuery, XSLT.
I think from your description that you need to focus on cleaning up your application architecture rather than on performance. Once you've cleaned it up, performance investigation and tuning will be vastly easier.
If you want the best technology for heavy duty XML processing, you might want to investigate this paper. The best technology will no doubt be clear after you read it...
The paper details :
Processing XML with Java – A Performance Benchmark
Bruno Oliveira1 ,Vasco Santos1 and Orlando Belo2 1 CIICESI,
School of Management and Technology,
Polytechnic of Porto Felgueiras, PORTUGAL
2 Algoritmi R&D Centre, University of Minho
4710-057 Braga, PORTUGAL

Does the complexity of XML structure has influence on parsing speed?

From "parsing speed" point of view, how much influence(if any) has number of attributes and depth of XML document on parsing speed?
Is it better to use more elements or as many attributes as possible?
Is "deep" XML structure hard to read?
I am aware that if I would use more attributes, XML would be not so heavy and that adapting XML to parser is not right way to create XML file
thanks
I think, it depends on whether you are doing validation or not. If you are validating against a large and complex schema, then proportionately more time is likely to be spent doing the validation ... than for a simple schema.
For non-validating parsers, the complexity of the schema probably doesn't matter much. The performance will be dominated by the size of the XML.
And of course performance also depends the kind of parser you are using. A DOM parser will generally be slower because you have to build a complete in-memory representation before you start. With a SAX parser, you can just cherry-pick the parts you need.
Note however that my answer is based on intuition. I'm not aware of anyone having tried to measure the effects of XML complexity on performance in a scientific fashion. For a start, it is difficult to actually characterize XML complexity. And people are generally more interested in comparing parsers for a given sample XML than in teasing out whether input complexity is a factor.
Performance is a property of an implementation. Different parsers are different. Don't try to get theoretical answers about performance, just measure it.
Is it better to use more elements or as many attributes as possible?
What has that got to do with performance of parsing? I find it very hard to believe that any difference in performance will justify distorting your XML design. On the contrary, using a distorted XML design in the belief that it will improve parsing speed will almost certainly end up giving you large extra costs in the applications that generate and consume the XML.
If you are using Sax parser it does not matter whether XML is a large one or not as it is a top down parser and not hold the full XML at memory but For DOM it matters as it holds the full XML in memory. You can get some idea about comparison of XML parsers in my blogpost here

Best way to input files to Xpath

I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.
I tried to use java nio file channels and memory mapped files but was hard to use with Xpath.
So can someone tell a way to do it ?
A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.
If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.
If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.
On *nix you have the utility called xpath exactly for that.
Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.
If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search Stackoverflow for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

Parsing binary data in Java - high volume, single thread

I need to parse (and transform and write) a large binary file (larger than memory) in Java. I also need to do so as efficiently as possible in a single thread. And, finally, the format being read is very structured, so it would be good to have some kind of parser library (so that the code is close to the complex specification).
The amount of lookahead needed for parsing should be small, if that matters.
So my questions are:
How important is nio v io for a single threaded, high volume application?
Are there any good parser libraries for binary data?
How well do parsers support streaming transformations (I want to be able to stream the data being parsed to some output during parsing - I don't want to have to construct an entire parse tree in memory before writing things out)?
On the nio front my suspicion is that nio isn't going to help much, as I am likely disk limited (and since it's a single thread, there's no loss in simply blocking). Also, I suspect io-based parsers are more common.
Let me try to explain if and how Preon addresses all of the concerns you mention:
I need to parse (and transform and write) a large binary file (larger
than memory) in Java.
That's exactly why Preon was created. You want to be able to process the entire file, without loading it into memory entirely. Still, the program model gives you a pointer to a data structure that appears to be in memory entirely. However, Preon will try to load data as lazily as it can.
To explain what that means, imagine that somewhere in your data structure, you have a collection of things that are encoded in a binary representation with a constant size; say that every element will be encoded in 20 bytes. Then Preon will first of all not load that collection in memory at all, and if you're grabbing data beyond that collection, it will never touch that region of your encoded representation at all. However, if you would pick the 300th element of that collection, it would (instead of decoding all elements up to the 300th element), calculate the offset for that element, and jump there immediately.
From the outside, it is as though you have a reference to a list that is fully populated. From the inside, it only goes out to grab an element of the list if you ask for it. (And forget about it immediately afterward, unless you instruct Preon to do things differently.)
I also need to do so as efficiently as possible in a single thread.
I'm not sure what you mean by efficiently. It could mean efficiently in terms of memory consumption, or efficiently in terms of disk IO, or perhaps you mean it should be really fast. I think it's fair to say that Preon aims to strike a balance between an easy programming model, memory use and a number of other concerns. If you really need to traverse all data in a sequential way, then perhaps there are ways that are more efficient in terms of computational resources, but I think that would come at the cost of "ease of programming".
And, finally, the format being read is very structured, so it would be
good to have some kind of parser library (so that the code is close to
the complex specification).
The way I implemented support for Java byte code, is to just read the byte code specification, and then map all of the structures they mention in there directly to Java classes with annotations. I think Preon comes pretty close to what you're looking for.
You might also want to check out preon-emitter, since it allows you to generate annotated hexdumps (such as in this example of the hexdump of a Java class file) of your data, a capability that I haven't seen in any other library. (Hint: make sure you hover with your mouse over the hex numbers.)
The same goes for the documentation it generates. The aim has always been to mak sure it creates documentation that could be posted to Wikipedia, just like that. It may not be perfect yet, but I'm not unhappy with what it's currently capable of doing. (For an example: this is the documentation generated for Java's class file specification.)
The amount of lookahead needed for parsing should be small, if that matters.
Okay, that's good. In fact, that's even vital for Preon. Preon doesn't support lookahead. It does support looking back though. (That is, sometimes part the encoding mechanism is driven by data that was read before. Preon allows you to declare dependencies that point back to data read before.)
Are there any good parser libraries for binary data?
Preon! ;-)
How well do parsers support streaming transformations (I want to be
able to stream the data being parsed to some output during parsing - I
don't want to have to construct an entire parse tree in memory before
writing things out)?
As I outlined above, Preon does not construct the entire data structure in memory before you can start processing it. So, in that sense, you're good. However, there is nothing in Preon supporting transformations as first class citizens, and it's support for encoding is limited.
On the nio front my suspicion is that nio isn't going to help much, as
I am likely disk limited (and since it's a single thread, there's no
loss in simply blocking). Also, I suspect io-based parsers are more
common.
Preon uses NIO, but only it's support for memory mapped files.
On NIO vs IO you are right, going with IO should be the right choice - less complexity, stream oriented etc.
For a binary parsing library - checkout Preon
Using a Memory Mapped File you can read through it without worrying about your memory and it's fast.
I think you are correct re NIO vs IO unless you have little endian data as NIO can read little endian natively.
I am not aware of any fast binary parsers, generally you want to call the NIO or IO directly.
Memory mapped files can help with writing from a single thread as you don't have to flush it as you write. (But it can be more cumbersome to use)
You can stream the data how you like, I don't forsee any problems.

Alternative to XSLT?

on my project I have a huuuuge XSLT used to convert some XML files to HTML.
The problem is that this file is growing up day by day, it's hard to read, debug and test.
So I was thinking about moving all the parsing process to Java.
Do you think is a good idea? In case what libraries to parse XML and generate HTML(XML) do u suggest? performances will be better or worse?
If it's not a good idea any alternative idea?
Thanks
Randomize
Take a look at CDuce - it is a strictly typed, statically compiled XML processing language.
I once had a client with a similar problem - thousands of lines of XSLT, growing all the time. I spent an hour reading it with increasing incredulity, then rewrote it in 20 lines of XSLT.
Refactoring is often a good idea, and the worse the code is, the more worthwhile refactoring is. But there's no reason to believe that just because the code is bad and in need of refactoring, you need to change to a different programming language. XST is actually very good at handling variety and complexity if you know how to use it properly.
It's possible that the code is an accumulation of special handling of special cases, and each new special case discovered results in more rules being added. That's a tough problem to tackle in any language, but XSLT can deal with it better than most, provided you apply your mind all the time to finding abstract general rules that encompass all the special rules, so you only need to code the special rules as exceptions.
I'd consider Velocity as an alternative. I prefer it to XSL-T. The transforms are harder to write than templates, because the latter look exactly like the XML I wish to produce. It's a simple thing to add in the markup to map in the data.

Categories