XSLT extraction of all input possible xpaths - java

I have really complex xsl (20 000 rows xsl file!) file which processes input file XML and makes output XML file.
I would like to extract all input file xpaths which are possible processed in the xsl. So I would like somehow to have a extraction of all (input)xpaths which are concerned in the XSL.
For that purpose I would like to find java API or tool which is able to give me this information with single passing of the XSL file. Are there such APIs or tools?
Regards
Aurel

Technically, that's easy- every node's processed in some way. I'm guessing what you really want is "what's specifically processed by a template that's defined in the stylesheet."
To do that, just add the identity template to the very bottom of your stylesheet- anything which hasn't already been processed already will be processed by that, and you can add in an <xsl:message> instruction to output whatever you need there. I'm fairly sure you'll find examples of how to generate the XPath of the context node around, that's been asked a few times.
EDIT: I forgot to mention- technically this will give you the inverse of what you asked for, it'll tell you what nodes are NOT processed. Hopefully you'll be able to use that to achieve the same goal though.

Related

count all occurences of xsl:result-document in a given stylesheet

I have an issue with part of my application in which I have some utility classes for xslt transformation functionality. I use SaxonHE as XSLT Transformer implementation.
My helper class has a function: URL mapFile(URL input, String stylesheetPath).
That takes the URL of one XML-File as input and returns a URL for the created XML-File. It handles the initialization and execution of the XSLT transformation.
But a stylesheet could theoretically create multiple XML files with xsl:result-document tags and I would like my utility class to be able to recognize if the given stylesheet will do that and handle it properly.
My idea was to analyse/parse the stylesheet from within my Java code and count all occurences of xsl:result-document.
With the values of the href-attributes, I would also know where the stylesheet creates the output XML files since I want to return a URL that points to their location.
So my changed utility method would be: List<URL> mapFile(URL input, String stylesheetPath) and return a number of URLs based on how many files are created by the given stylesheet.
But I have no idea how to do this in Java code and all my google searches concerning counting elements in a xsl stylesheet resulted in explanations how to count XML elements of the input XML from inside the stylesheet, which is not what I want to do.
EDIT: I ended up not doing any parsing of the stylesheet at all. I just create a folder and if someone writes a stylesheet that doesnt put all result files in that folder then it is their fault if they dont get a URL back for that result document . A hacky solution but it works for my use-case.
For a single-module stylesheet it's very simple: just execute the XPath expression count(//xsl:result-document).
For a stylesheet with multiple modules it gets more complicated because you have to follow xsl:include and xsl:import references, and more particularly, you have to detect cycles in the include/import graph so you don't go into an infinite loop.
You could export the stylesheet to a SEF file and execute count(//*:resultDoc) on the SEF file. Unfortunately that's Saxon-EE which will cost you money, but then writing the code by hand will cost you money too...
But actually you've asked for two different things. First you say you want to know the number of xsl:result-document instructions, then you say you want to know how many result documents are created. These aren't the same thing, because you don't know how often each xsl:result-document instruction is executed.
I suspect you can solve the problem by registering a result document handler with Saxon and using it to monitor calls on xsl:result-document at run-time.

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

Is there a clean way to to transform text files that are not the same into a standard format

I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

How to write a xml database file efficiently?

I want to build an XML file as a datastore. It should look something like this:
<datastore>
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
....
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
</datastore>
At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I just want to write the part where a change occures. Or does DOM supports this?
I had a first look at StAX, but I am not sure if it does what I want.
Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .
Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.
I can't see that StAX supports any of this, does it?
Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?
Thanks in advance.
Updating XML is basically impossible because there's no "cheap" way to insert data.
Appending XML is not so bad. All you need to do there is seek to the end of the file, then GO BACK over the "end tag" (</datastore> in this case), and then just start writing. This is a cheap operation all told, but none of the frameworks really support this as they're all mostly designed to work with well formed, full boat XML documents, as a whole, not in pieces.
You could use a StAX like thing, but in this case, StAX isn't aware of the <datastore> tag, rather it's just aware of the <item> tags and its elements. Then you create Items and start writing, over and over and over, to the same OutputStream that you have set up.
That's the best way to do this.
But if you need to delete or change data, then you get to rewrite stuff, or do hacks, such as marking elements as "inactive", hunting them down in the XML file, seeking to the 'active="Y"' attribute, and then inplace changing the Y to N. It can be done, it will be mostly efficient, but its far and away outside what the normal XML processing frameworks let you do. If I were to do that, I'd read the entire file and keep track of those entries and note their locations within it so later I could easily seek and change them efficiently.
Then when you update something, you "inactivate" the old one, and "append" the new one. Eventually get to GC the file by rewriting it all and throwing out the old, "inactive" entries.
As a rule of thumb, XML files aren't very efficient as datastores, not for the record-based data you seem to want to use them for.
But if you've already got the file and absolutely can't do anything about it, you can use StAX XMLEventReaders and XMLEventWriters to read through a file quickly and insert/modify elements in it.
But when I say quickly, what I mean is more quickly than DOM would be, but nowhere near as effective as any relational DB.
Update: Another option you can consider is vtd-xml, although I haven't tried it in real projects, it actually looks pretty decent.
If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:
<!DOCTYPE datastore [
<!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>
The file items.xml looks like this:
<item>....</item>
<item>....</item>
<item>....</item>
with no wrapper element.
When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.
Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.
It's not very easy or efficient to partially update an XML file so you won't find much support for it as a use case.
Really it sound like you need a proper database, perhaps with a tool to export the data as XML.
If you don't want to use a DB and insist on storing the data purely as XML you might consider keeping all your items in memory as objects. Whenever a new one is added you can write all of them out to XML. It might seem inefficient, but depending on your data size might still be good enough.
If you choose this path, you might want to check out the Xstream library to make this quite easy, see stream tutorial for a quick example.

Categories