Java XML parsing DOM performance - java

I'm part of a team creating a data store that passes information around in large XML documents (herein called messages). On the back end, the messages get shredded apart and stored in accumulo in pieces. When a caller requests data, the pieces get reassembled into a message tailored for the caller. The schemas are somewhat complicated so we couldn't use JAXB out of the box. The team (this is a few years ago) assumed that DOM wasn't performant. We're now buried in layer after layer of half-broken parsing code that will take months to finish, will break the second someone changes the schema, and is making me want to jam a soldering iron into my eyeball. As far as I can tell, if we switch to using the DOM method a lot of this fart code can be cut and the code base will be more resilient to future changes. My team lead is telling me that there's a performance hit in using the DOM, but I can't find any data that validates that assumption that isn't from 2006 or earlier.
Is parsing large XML documents via DOM still sufficiently slow to warrant all the pain that XMLBeans is causing us?
edit 1 In response to some of your comments:
1) This is a government project so I can't get rid of the XML part (as much as I really want to).
2) The issue with JAXB, as I understand it, had to do with the substitution groups present in our schemas. Also, maybe I should restate the issue with JAXB being one of the ratio of effort/return in using it.
3) What I'm looking for is some kind of recent data supporting/disproving the contention that using XMLBeans is worth the pain we're going through writing a bazillion lines of brittle binding code because it gives us an edge in terms of performance. Something like Joox looks so much easier to deal with, and I'm pretty sure we can still validate the result after the server has reassembled a shredded message before sending it back to the caller.
So does anyone out there in SO land know of any data germane to this issue that's no more than five years old?

Data binding solutions like XMLBeans can perform very well, but in my experience they can become quite unmanageable if the schema is complex or changes frequently.
If you're considering DOM, then don't use DOM, but one of the other tree-based XML models such as JDOM2 or XOM. They are much better designed.
Better still (but it's probably too radical a step given where you are starting) don't process your XML data in Java at all, but use an XRX architecture where you use XML-based technologies end-to-end: XProc, XForms, XQuery, XSLT.
I think from your description that you need to focus on cleaning up your application architecture rather than on performance. Once you've cleaned it up, performance investigation and tuning will be vastly easier.

If you want the best technology for heavy duty XML processing, you might want to investigate this paper. The best technology will no doubt be clear after you read it...
The paper details :
Processing XML with Java – A Performance Benchmark
Bruno Oliveira1 ,Vasco Santos1 and Orlando Belo2 1 CIICESI,
School of Management and Technology,
Polytechnic of Porto Felgueiras, PORTUGAL
2 Algoritmi R&D Centre, University of Minho
4710-057 Braga, PORTUGAL

Related

Xml parsing and writing txt file using multithread in java

I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.
example of xml file:
<text>
<paragraph>
<line>
<character>g</character>
<character>o</character>
.....
</line>
<line>
<character>k</character>
.....
</line>
</paragraph>
</text>
<text>
<paragraph>
<line>
<character>c</character>
.....
</line>
</paragraph>
</text>
example of text file:
go..
k..
c..
How can I parse many xml files and write many text files using multi thread in java as fast as I can?
Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?
I have no experience in multi thread. How should I build a multi-thread structure to be effective?
Any help is appreciated. Thanks in advance.
EDIT
I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?
Points to note in this situation.
In many cases, attempting to write to multiple files at once using multi-threading is utterly pointless. All this generally does is exercise the disk heads more than necessary.
Writing to disk while parsing is also likely a bottleneck. You would be better to parse the xml into a buffer and then writing the whole buffer to disk in one hit.
The speed of your parser is unlikely to affect the overall time for the process significantly. Your system will almost certainly spend much more time reading and writing than parsing.
A quick check with some real test data would be invaluable. Try to get a good estimate of the amount of time you will not be able to affect.
Determine an approximate total read time by reading a few thousand sample files into memory because that time will still need to be taken however parallel you make the process.
Estimate an approximate total write time in a similar way.
Add the two together and compare that with your total execution time for reading, parsing and writing those same files. This should give you a good idea how much time you might save through parallelism.
Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.
First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.
However, toward the actual question:
Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.
The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?
If you write your code in XSLT (2.0 or later), using the collection() function to parse your source files, and the xsl:result-document instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.
This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.
LATER
I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character> elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:
<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
<xsl:value-of select="."/>
</xsl:template>
If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.
If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.
If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.

Convert asterisks to bold or italic tags

I want to convert text between markdown style bold/italics to html bold/italics. Here's an example:
**Bold text** is bold, *italic* text is italicized.
Should go to:
<b>Bold text</b> is bold, <i>italic</i> text is italicized.
I looked elsewhere on SO, but most questions recommended a parsing library. However, I think using a library would be unsuitable for the following reasons:
I'm trying to keep the code base as small as possible
A parser will have too many features!
I want to make it as fast & lightweight as possible
How should I go about converting these tags then?
I have tried to do this myself in the past thinking exactly as you have trying to hand bake the solution. The number of exceptions you have to cater for once you add one or two more markups becomes very complex. I ended up re-inventing the wheel in a much less eficent manner. I opted to adopt one of the parsing libraries and never looked back.
A parser will have too many features!
You can get some parsers that let you define your own markup language. This is what I opted for. I did it in .Net so I can't suggest a Java version.
I want to make it as fast & lightweight as possible
Any parsing library will be more efficient than your own and unless you're parsing many MBs of data I don't think you'll notice much difference. They have usually spent much more time on making it efficient that I maybe you would be willing to.
I know this isn't an "answer" as such, but I hope I save you some time (and delay the onset of gray hair) or point you in the right direction.

Alternative to XSLT?

on my project I have a huuuuge XSLT used to convert some XML files to HTML.
The problem is that this file is growing up day by day, it's hard to read, debug and test.
So I was thinking about moving all the parsing process to Java.
Do you think is a good idea? In case what libraries to parse XML and generate HTML(XML) do u suggest? performances will be better or worse?
If it's not a good idea any alternative idea?
Thanks
Randomize
Take a look at CDuce - it is a strictly typed, statically compiled XML processing language.
I once had a client with a similar problem - thousands of lines of XSLT, growing all the time. I spent an hour reading it with increasing incredulity, then rewrote it in 20 lines of XSLT.
Refactoring is often a good idea, and the worse the code is, the more worthwhile refactoring is. But there's no reason to believe that just because the code is bad and in need of refactoring, you need to change to a different programming language. XST is actually very good at handling variety and complexity if you know how to use it properly.
It's possible that the code is an accumulation of special handling of special cases, and each new special case discovered results in more rules being added. That's a tough problem to tackle in any language, but XSLT can deal with it better than most, provided you apply your mind all the time to finding abstract general rules that encompass all the special rules, so you only need to code the special rules as exceptions.
I'd consider Velocity as an alternative. I prefer it to XSL-T. The transforms are harder to write than templates, because the latter look exactly like the XML I wish to produce. It's a simple thing to add in the markup to map in the data.

is creating a unique html file for each article a good practice?

sorry for poor topic name, i could not think for any thing better ;)
i am working on a news broadcast web site project, and the stake holder asked me to create a unique html file for each article and save it on disk instead of using a dbms like mysql , so the users can access the file directly and no computing will be needed so there wont be any bottle neck in that case.
and i did so.
and my question is , is this(what he asked me) a good and popular practice in programming?
what are the pros and cons?
thank you all and sorry for my poor English writing :P
If you got a template and can generate these pages automatically, it can be a good practise. Like you say, it prevents your server from having to generate the page. It only needs to put through the plain page.
And if you need to change the layout, or need to edit an article, you can just regenerate the page.
It is quite common, although lots of pages always have some dynamic content, like a date, user info or other session or time specific data. In this case you cannot cache the entire page. Of course you can combine both. Have dynamic index pages and front page, and only cache the actual articles themselves. But I read in your question that that is what you've done now.
Pros:
Faster retrieval of pages
Less load on your webserver
Less load on your database server
Cons:
Need to do some extra work to update the cache when an article is modified
Cannot have any dynamic content in the page
There probably isn't a problem at all. Most webservers are able to server large amounts of dynamic pages (premature optimization is the root of all evil).
There are other ways to speed things up, that don't have the above cons. You could cache query results in Memcache and/or use APC cache to speed up your PHP code and decrease disk I/O.
But there are web hosting companies dedicated entirely onto serving static content. That static content can be server from in-memory too, making it even faster than APC cached dynamic content, so if you really really really need the performance, yes, this is the way to go. But I seriously doubt you do.
Static pages are good for small websites. If you have the chance, go for it but if you need complex operations, dynamic page structure should be the way to go.
For an article site, I'd go with dynamic pages since the concept is dynamic (You'll need to update the site, add new articles, maybe add new features like commenting, user activity etc).
It is easier to add/delete/edit an article directly from an admin panel, with static pages, you'd have to find your way through the html code.
The list would go on and on...
Without a half-decent templating system, you'd have to store the full article AND the page layout and styles in the one file.
This means, it'd be difficult to update look and feel across all the published articles, and if you wanted to query the article list and return a list (such as those form a specific author or in a specific category), you'd be a bit stuck too.
If you think of it as a replacement for your database: No, that's not good pratice. You loose a lot of information, editing pages later will be harder as well es setting up indexed search functions,...
If you think of it as a caching solution: Then yes, this is good practice and also a common technique. But think on how to do the caching, when to replace the files with new versions and only do it if you have few write accesses and a lot of read accesses to your pages (which is typical for an article site ^^)
Definitely not a common practice, and I would not do it this way. Especially for the reasons of having a bottleneck - you won't have any bottletneck there. Nor any performance problem. How much unique visitors is your site likely to be getting? Hundreds of thousands?
In fact, reading from the disk is more likely to be a problem. DB operations can be optimized, cached in memory, etc - the db server performs various optimizations. On the other hand, you read the file each time (or handle the caching yourself).
The usual and preferred way to do it is:
store and load content from DB
have a template (header + footer) for the page, and only insert the content
have an admin panel with an editor (as rich as possible) where you can modify the content of the articel
I started out asking myself why a stakeholder might be asking you to implement a system this way. Why would he / she care, as long as your system meets the requirements? There are two possible answers to this:
The stakeholder is a bit of a control freak; e.g. an ex-techie who likes to interfere with what his developers do.
The stakeholder has had a bad experience in the past; e.g. with a previous system where the content was "locked into" a database with an unwieldy front end that made life hell for the users.
From this standpoint, how would you address the problem? My take is that you need to get to the bottom of why the stakeholder is asking for this. Does he have some genuine concern? Can you address that concern in the system design?
The bottom line is that "is this best practice" is not the overriding criterion here. Arguably, "what the customer wants" or "what the customer needs" are more important.
What I think you need to do is:
Find out what the stakeholder's real concern is.
Discuss with him / her (and other stakeholders) the design options that will address those concerns. Present them with the alternatives and an honest assessment of their implications, and involve them in the decision making.

Best practices in internationalizing text with lots of markup?

I'm working on a web project that will (hopefully) be available in several languages one day (I say "hopefully" because while we only have an English language site planned today, other products of my company are multilingual and I am hoping we are successful enough to need that too).
I understand that the best practice (I'm using Java, Spring MVC, and Velocity here) is to put all text that the user will see in external files, and refer to them in the UI files by name, such as:
#in messages_en.properties:
welcome.header = Welcome to AppName!
#in the markup
<title>#springMessage("welcome.header")</title>
But, having never had to go through this process on a project myself before, I'm curious what the best way to deal with this is when you have some segments of the UI that are heavy on markup, such as:
<p>We are excited to announce that Company1 has been acquired by
Division X,
a fast-growing division of Company 2, Inc.
(Nasdaq: BLAH), based in...
One option I can think of would be to store this "low-level" of markup in messages.properties itself for the message - but this seems like the worst possible option.
Other options that I can think of are:
Store each non-markup inner fragment in messages.properties, such as acquisitionAnnounce1, acquisitionAnnounce2, acquisitionAnnounce3. This seems very tedious though.
Break this message into more reusable components, such as Company1.name, Company2.name, Company2.ticker, etc., as each of these is likely reused in many other messages. This would probably account for 80% of the words in this particular message.
Are there any best practices for dealing with internationalizing text that is heavy with markup such as this? Do you just have to bite down and bear the pain of breaking up every piece of text? What is the best solution from any projects you've personally dealt with?
Typically if you use a template engine such as Sitemesh or Velocity you can manage these smaller HTML building blocks as subtemplates more effectively.
By so doing, you can incrementally boil down the strings which are the purely internationalized ones into groups and make them relevant to those markup subtemplates. Having done this sort of work using templates for an app which spanned multi-languages in the same locale, as well as multiple locales, we never ever placed markup in our message bundles.
I'd suggest that a key good practice would be to avoid placing markup (even at a low-level as you put it) inside message properties files at all costs! The potential this has for unleashing hell is not something to be overlooked - biting the bullet and breaking things up correctly, is far less of a pain than having to manage many files with scattered HTML markup. Its important you can visualise markup as holistic chunks and scattering that everywhere would make everyday development a chore since:
You would lose IDE color highlighting and syntax validation
High possibility that one locale file or another can easily be missed when changes to designs / markup filter down
Breaking things down (to a realistic point, eg logical sentence structures but no finer) is somewhat hard work upfront but worth the effort.
Regarding string breakdown granularity, here's a sample of what we did:
comment.atom-details=Subscribe To Comments
comment.username-mandatory=You must supply your name
comment.useremail-mandatory=You must supply your email address
comment.email.notification=Dear {0}, the comment thread you are watching has been updated.
comment.feed.title=Comments on {0}
comment.feed.title.default=Comments
comment.feed.entry.title=Comment on {0} at {1,date,medium} {2,time,HH:mm} by {3}
comment.atom-details=Suscribir a Comentarios
comment.username-mandatory=Debes indicar tu nombre
comment.useremail-mandatory=Debes indicar tu direcci\u00f3n de correo electr\u00f3nico
comment.email.notification=La conversaci\u00f3n que estas viendo ha sido actualizada
comment.feed.title=Comentarios sobre {0}
comment.feed.title.default=Comentarios
comment.feed.entry.title=Comentarios sobre {0} a {1,date,medium} {2,time,HH:mm} por {3}
So you can do interesting things with how you string replace in the message bundle which may also help you preserve it's logical meaning but allow you to manipulate it mid sentence.
As others have said, please never split the strings into segments. You will cause translators grief as they have to coerce their language syntax to the ad-hoc rules you inadvertently create. Often it will not be possible to provide a grammatically correct translation, especially if you reuse certain segments in different contexts.
Do not remove the markup, either.
Please do not assume professional translators work in Notepad :) Computer-aided translation (CAT) tools, such as the Trados suite, know about markup perfectly well. If the tagging is HTML, rather than some custom XML format, no special preparation is required. Trados will protect the tags from accidental modification, while still allowing changes where necessary. Note that certain elements of tags often need to be localized, e.g. alt text or some query strings, so just stripping all the markup won't do.
Best of all, unless you're working on a zero-budget personal project, consider contacting a localization vendor. Localization is a service just like web design. A competent vendor will help you pick the optimal solution/format for your project and guide you through the preparation of the source material and incorporating the localized result. And of course they and their translators will have all the necessary tools. (Full disclosure: I am a translator / localization specialist. And don't split up strings :)
First off, don't split up your strings. This makes it much harder for localizers to translate text because they can't see the entire string to translate.
I would probably try to use placeholders around the links:
Division X
That's how I did it when I was localizing a site into 30 languages. It's not perfect, but it works.
I don't think it's possible (or easy) to remove all markup from strings, you need to have a way to insert the urls and any extra markup.
You should avoid breaking up your strings. Not only does this become a nightmare to translate, but it also makes grammatical assumptions which may not be correct in the target language.
While placeholders can be helpful for many things, I would not recommend using placeholders for URLs. This allows you to customize the URL for different locales. After all, no sense sending them to an English language page when their locale is Argentine Spanish!

Categories