Transforming XML with Java - java

I was learning how to convert an XML file into a HTML using just Java, then later I decided to learn how to use the XSLT language to do the same.
By saying just java, I mean, using just the syntax of the Java language, that is, not XSLT language.
To clarify:
Loading XML into a DOM (using a DocumentBuilder).
Parsing it (just doing things like doc.getFirstChild()).
Writing it to a HTML file (just using a character stream, not a XML serialization).
What happened?
After I include the following line in my XML:
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
My Java application couldn't write the HTML right...
If I remove that, everything is right, but I want to keep it.
Any ideas how to ignore this "instruction"?

XSLT will ignore processing instructions (that is, remove them) by default. If you want to retain this one, just add a template rule to do so:
<xsl:template match="processing-instruction('xml-stylesheet')">
<xsl:copy/>
</xsl:template>
This assumes that your stylesheet is written in the classic recursive-descent style using apply-templates; if you're self-taught in XSLT then you might not have yet learnt this style. As always, it's much easier to help people when they show us the code.

It depends on how you are reading the XML from your Java application. But if your XML has an embedded Processing Instruction like
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
then it means that the stylesheet is an integral part of the data, and must be applied to the XML for it to be of any use. This is very similar to a CSS stylesheet processing instruction like, for example
<?xml-stylesheet type="text/css" href="standard.css"?>
which, in the same way, is an integral part of the XHTML, just as if it was an internal style within <style> tags.
It is clearly possible to read and use the XML without applying the stylesheet, but that is to ignore the directive of the data itself.
If you want to treat the XML as raw data and apply an optional transform to it in different ways then you must omit the processing instruction from the XML.

Sorry guys, I thought that the XML with the stylesheet.xsl was being "transformed" in the DOM object that I was using to parse the XML.
I made assumptions that:
The XML was being transformed before being put in the DOM.
The <?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?> was invisble in the DOM.
Basically I had a simple XML to start learning how to do the transformation. Something like the following:
<?xml version="1.0" encoding="UTF-8"?>
<items><item>...</item></items>
For simplicity (I was learning...) I decided to start my parsing with:
parse(doc.getFirstChild().getFirstChild()); //Expecting the first "item".
But after introducing the stylesheet to the XML the document became:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>
<items><item>...</item></items>
Because of this introduction the doc.getFirstChild().getFirstChild() was not being a "item" anymore.
Then I just realize that I forgot to skip the node with this instruction (I really thought that it was "invisible" in the DOM tree).
Learning guys, learning...
P.S. That was my first attempt to transform a XML with XSLT!
Thank you for your help.

Related

How to invoke XML <?xml-stylesheet ...?> directives in Java?

I have an XML file which references an associated XSL file, like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="my-transform.xsl"?>
<my-root> .....
and I want to read it in as a org.w3c.dom.Document, applying the transform.
I'm considering reading it in, extracting the stylesheet processing-instruction using XPATH /processing-instruction('xml-stylesheet') and then loading the XSL file by hand and applying it with a Transformer.
But it seems odd that I need to do this manually - is there a neat way to read the file and apply the embedded transform automatically?
UPDATE: thanks to #raphaëλ for observing that TransformerFactory.getAssociatedStylesheet(...) will identify the xml-stylesheet value as a Source, which is pretty close. Is there anything more automatic than that?
Ok, nobody else answered, and I know the answer now. Stylesheets are not applied automatically. But you can get hold of the stylesheet using TransformerFactory.getAssociatedStylesheet(...), which will identify the xml-stylesheet value as a Source. You can then apply it manually.
Thanks to raphaëλ for pointing this out.

Stax does not ready characters like "“" [duplicate]

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.
You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.
BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>

Is it possible to parse multiple XMLs separately from a single HTTP post request with DOM parser (in Java)?

I am making an XML parser that needs to take multiple XML files in one go and parse them as individual files all from the same HTTP post.
The layout of them are roughly the same - here's an example of 2 of the XMLs together:
<?xml version="1.0" encoding="UTF-8"?>
<report>
...
</report>
<report>
...
</report>
I was thinking maybe some way of Java finding each instant of </report> and ultimately splitting the full XML into multiple and parsing each one separately. Is this possible using DOM and if not, is there any way this is possible in Java?

Parsing an a false xml using jaxb

I have a situation where the xml(But its not really a xml data, instead a tag based custom data format) is send from a third party server(Because of that I cant change the format and coordinating with the third party is pretty difficult. The markup looks like as follows
<?xml version="1.0" encoding="UTF-8"?>
<result>SUCCESS</result>
<req>
<?xml version="1.0" encoding="UTF-8"?>
<Secure>
<Message id="dfgdfdkjfghldkjfgh88934589345">
<VEReq>
<version>1.0.2</version><pan>3453243453453</pan>
<Merchant><acqBIN>433274</acqBIN>
<merID>3453453245</merID>
<password>342534534</password>
</Merchant>
<Browser></Browser>
</VEReq>
</Message>
</Secure>
</req>
<id>1906547421350020</id>
<trackid>f68fb35c-cbc2-468b-aaf8-7b3f399b709d</trackid>
<ci>6</ci>
Now here I want only result, req, id, trackid and ci tags value as the parse output. Means after parsing I need req to contain all contents inside tags. One more point here is the req tag is embedd with another xml as it is not as a CDATA. I cant parse it using JAXB.
Can somebody have library that can parse all the content if I can configure the avialable tags in a file, or any other way. I really dont want to convert them to an object, even a hashmap with tag as a key and content as value is also fine. But I prefer the POJO model(Generating a class from this kind of xml).
Let me know if somebody can help me.
Make it well-formed XML first and the pass to whatever tool you find suitable. JAXB is not bad as it will ignore elements it does not know (apart from the root element).
And since most (if not all) tools expect well-formed XML anyway, you'll have to take care of turning your "false" XML into "true" XML first. I'd first try something like JTidy or JSoup ans see if they help to make your non-well-formed XML well-formed.
If it does not work I'd try to hack it on the lower-level SAX or StAX parsing. The XML you posted seems to suffer from two problems: no single root element and XML declaration in the body. I think both problems can be addressed with some minimal parser hacking.
And I think there is a special place in hell for people who invent this type non-wellformed XML. Damned to sit there and correct all the HTML documents on the Internet into valid XHTML by hand.

How to transform huge xml files in java?

As the title says it, I have a huge xml file (GBs)
<root>
<keep>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</keep>
<discard>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</discard>
</root>
and I'd like to transform it into a much smaller one which retains only a few of the elements.
My parser should do the following:
1. Parse through the file until a relevant element starts.
2. Copy the whole relevant element (with children) to the output file. go to 1.
step 1 is easy with SAX and impossible for DOM-parsers.
step 2 is annoying with SAX, but easy with the DOM-Parser or XSLT.
so what? - is there a neat way to combine SAX and DOM-Parser to do the task?
StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)
Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.
It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.
(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)
I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.
It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.
UPDATE
I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
version="1.0" pass-through="none" output-method="xml">
<stx:template match="element/child">
<stx:process-self group="copy" />
</stx:template>
<stx:group name="copy" pass-through="all">
</stx:group>
</stx:transform>
The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.
The following is an example input file:
<root>
<child attribute="no-parent, so no copy">
</child>
<element id="id1">
<child attribute="value1">
text1<b>bold</b>
</child>
</element>
<element id="id2">
<child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" ></b>
</x:childX>
</child>
</element>
</root>
This is the corresponding output file:
<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
text1<b>bold</b>
</child><child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" />
</x:childX>
</child>
The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.
Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.
If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.
Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.
For such a large XML document something with a streaming architecture, like Omnimark, would be ideal.
It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:
process
submit #main-input
macro upto (arg string) is
((lookahead not string) any)*
macro-end
find (("<keep") upto ("</keep>") "</keep>")=>keep
output keep
find any
You can do this quite easily with an XMLEventReader and several XMLEventWriters from the javax.xml.stream package.

Categories