I'm doing a xstl transformation with saxon from an XML document.
The doc is not standard-valid XML, and I want to preserve all <![CDATA[< elements that are found in there.
However using the .xsl file for transformation with
Transformer trans = TransformerFactory.newInstance().newTransformer(new StreamSource(new File("foo.xsl"));
trans.transform(new StreamSource(new File("foo.xml"), new StreamResult(new File("output.xml")));
results in stripping out these CDATA entries. How can I prevent this?
You can't, as the distinction whether a text originated from a cdata section is not available in the datamodel used by xslt. You can however define in your stylesheet that certain result elements are to be wrapped inside cdata. This is done using the cdata-section-elements attribute of the xsl:output element in your stylesheet.
Consider using Andrew Welch's LexEv tool (bundled I believe with KernowForSaxon), which preprocesses CDATA start and end tags into something different (processing instructions perhaps?) that's visible in the XSLT data model and thus available to the application.
Related
I use tagsoup as (SAX) XMLREader and set the namespace feature to false. This parser is used to feed the Transformer as SAX Source. Complete code:
final TransformerFactory factory = TransformerFactory.newInstance();
final Transformer t = factory.newTransformer(new StreamSource(
getClass().getResourceAsStream("/identity.xsl")));
final XMLReader p = new Parser(); // the tagsoup parser
p.setFeature("http://xml.org/sax/features/namespaces", false);
// getHtml() returns HTML as InputStream
final Source source = new SAXSource(p, new InputSource(getHtml()));
t.transform(source, new StreamResult(System.out));
This results in something like:
< xmlns:html="http://www.w3.org/1999/xhtml">
<>
<>
<>
<>
< height="17" valign="top">
Problem is that the tag names are blank. The XMLReader (tagsoup parser) does report an empty namespaceURI and empty local name in the SAX methods ContentHandler#startElement and ContentHandler#endElement. For a not namespace aware parser this is allowed (see Javadoc).
If i add a XMLFilter which copies the value of the qName to the localName, everything goes fine. However, this is not what i want, i expect this works "out of the box". What am i doing wrong? Any input would be appreciated!
I expect this works "out of the box". What am i doing wrong?
What you are doing wrong is taking a technology (XSLT) that is defined to operate over namespace-well-formed XML and attempting to apply it to data that it is not intended to work with. If you want to use XSLT then you must enable namespaces, declare a prefix for the http://www.w3.org/1999/xhtml namespace in your stylesheet, and use that prefix consistently in your XPath expressions.
If your transformer understands XSLT 2.0 (e.g. Saxon 9) then instead of declaring a prefix and prefixing your element names in XPath expressions, you can put xpath-default-namespace="http://www.w3.org/1999/xhtml" on the xsl:stylesheet element to make it treat unprefixed element names as references to that namespace. But in XSLT 1.0 (the default built-in Java Transformer implementation) your only option is to use a prefix.
I am using JAXB and maven-jaxb2-plugin and I am able right now to bind my schemas to Java code successfully.
I also have a .xsl file "annotate_schemas.xsl" that modifies a specific schema adding some additional information.
Finally, on the schema that I want transformed, I added the header:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="annotate_schemas.xsl"?>
...
The problem is that, while the .xsl is correct (if I open my schema file in a browser, the transformation is done flawlessly), JAXB ignores it and binds an untouched version of my schema.
My question is: Does JAXB (and/or its plugin) have an XSLT processor?? Is there a way to tell JAXB to bind the result of the XSLT transformation instead of the original?
Thank you very much
JAXB, like the vast majority of XML-consuming applications, takes no notice of an <?xml-stylesheet?> processing instruction. If you want to transform a document before passing it to JAXB, you need to transform it explicitly, for example by using the JAXP transformation API. (There is an option in JAXP to request transformation according to the value of the xml-stylesheet PI if that's how you want to control it: TransformerFactory.useAssociatedStylesheet()).
You can try something like this:
TransformerFactory transFact = TransformerFactory.newInstance();
Templates displayTemplate = transFact.newTemplates(new StreamSource(new File("your_xsl_file")));
TransformerHandler handler =
((SAXTransformerFactory) transFact).newTransformerHandler(displayTemplate);
Mostly continued from this question: XSLT: CSV (or Flat File, or Plain Text) to XML
So, I have an XSLT from here: http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html
And it converts a CSV file to an XML document. It does this when used with the following command on the command line:
java -jar saxon9he.jar -xsl:csv-to-xml.csv -it:main -o:output.xml
So now the question becomes: How do I do I do this in my Java code?
Right now I have code that looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
StreamSource xsltSource = new StreamSource(new File("location/of/csv-to-xml.xsl"));
Transformer transformer = transformerFactory.newTransformer(xsltSource);
StringWriter stringWriter = new StringWriter();
transformer.transform(documentSource, new StreamResult(stringWriter));
String transformedDocument = stringWriter.toString().trim();
(The Transformer is an instance of net.sf.saxon.Controller.)
The trick on the command line is to specify "-it:main" to point right at the named template in the XSLT. This means you don't have to provide the source file with the "-s" flag.
The problem starts again on the Java side. Where/how would I specify this "-it:main"? Wouldn't doing so break other XSLT's that don't need that specified? Would I have to name every template in every XSLT file "main?" Given the method signature of Transformer.transform(), I have to specify the source file, so doesn't that defeat all the progress I've made in figuring this thing out?
Edit: I found the s9api hidden inside the saxon9he.jar, if anyone is looking for it.
You are using the JAXP API, which was designed for XSLT 1.0. If you want to make use of XSLT 2.0 features, like the ability to start a transformation at a named template, I would recommend using the s9api interface instead, which is much better designed for this purpose.
However, if you've got a lot of existing JAXP code and you don't want to rewrite it, you can usually achieve what you want by downcasting the JAXP objects to the underlying Saxon implementation classes. For example, you can cast the JAXP Transformer as net.sf.saxon.Controller, and that gives you access to controller.setInitialTemplate(); when it comes to calling the transform() method, just supply null as the Source parameter.
Incidentally, if you're writing code that requires a 2.0 processor then I wouldn't use TransformerFactory.newInstance(), which will give you any old XSLT processor that it finds on the classpath. Use new net.sf.saxon.TransformerFactoryImpl() instead, which (a) is more robust, and (b) much much faster.
Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.
There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.
I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.
Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.
TagSoup
JSoup
Beautiful Soup
I apologize for the elementary question. I have an XML file, as well as an XSL to translate it into another format (KML). Within the KML I wish to inject a dynamic attribute which is not present in the original XML document. I want to emit a node like the following:
<NetworkLinkControl>
<message>This is a pop-up message. You will only see this once</message>
<cookie>sessionID={#sessionID}</cookie>
<minRefreshPeriod>5</minRefreshPeriod>
</NetworkLinkControl>
In particular I want the {#sessionID} text to be replaced with a dynamic value that I insert into the template somehow (i.e. is NOT part of the source XML document that the XSLT is transforming).
Here's the code I'm using to marshal the KML:
DomainObject myObject = ...;
JAXBContext context = JAXBContext.newInstance(new Class[]{DomainObject.class});
Marshaller xmlMarshaller = context.createMarshaller();
xmlMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
TransformerFactory transFact = TransformerFactory.newInstance();
// converts from jaxb XML representation into KML
Templates displayTemplate = transFact.newTemplates(new StreamSource(new File("conf/jaxbkml.xsl")));
Result outputResult = new StreamResult(System.out);
TransformerHandler handler =
((SAXTransformerFactory) transFact).newTransformerHandler(displayTemplate);
handler.setResult(outputResult);
Transformer transformer = handler.getTransformer();
// TODO: what do I actually fill in here to ensure that the session ID comes through
// in the XSLT document? I can't make heads or tails of the javadocs
transformer.setOutputProperty("{http://xyz.foo.com/yada/baz.html}sessionID", "asdf");
xmlMarshaller.marshal(myObject, handler);
I have gathered that there is a way to substitute in values dynamically in the XSLT via Attribute Value Templates and I assume that there is a way to hookup the transformer's properties to be used with these Attribute Value Templates, but I don't quite see how it's done. Could someone shed some light? Thanks.
Thanks to #jtahlborn for setting me on the right track. It is possible to do this, but I wasn't putting all the pieces together. First, define xsl:param.
<!-- give it a default value if none is set -->
<xsl:param name="sessionID" select="''"/>
Second, insert a reference to this xsl:param. If you need to embed it within the content of a node, as I did, use an xsl:value-of node.
<cookie>sessionID=<xsl:value-of
select="$sessionID"/></cookie>
Otherwise, if you need to embed it within an attributes string:
<img src="{$sessionID}/sample.gif"/>
Next, pass in a value for that xsl:param from within Java.
Result outputResult = new StreamResult(outputStream);
TransformerHandler handler =
((SAXTransformerFactory) transFact).newTransformerHandler(displayTemplate);
Transformer transformer = handler.getTransformer();
// Here is where the parameter is bound.
transformer.setParameter("sessionID", sessionID);
handler.setResult(outputResult);
xmlMarshaller.marshal(listWrapper, handler);
The attribute value templates are part of your XSL, not part of your XML, so what you are attempting won't work. You could use xpath to select the element which matches the pattern "sessionID={#sessionID}" and replace that with the text of your choice.
i believe you can set parameters for the stylesheet using the Transformer.setParameter() method which can then be referenced in the stylesheet using the syntax "{$param}", see examples here.