Saxon in Java: XSLT for CSV to XML

Saxon in Java: XSLT for CSV to XML - java

Mostly continued from this question: XSLT: CSV (or Flat File, or Plain Text) to XML
So, I have an XSLT from here: http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html
And it converts a CSV file to an XML document. It does this when used with the following command on the command line:
java -jar saxon9he.jar -xsl:csv-to-xml.csv -it:main -o:output.xml
So now the question becomes: How do I do I do this in my Java code?
Right now I have code that looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
StreamSource xsltSource = new StreamSource(new File("location/of/csv-to-xml.xsl"));
Transformer transformer = transformerFactory.newTransformer(xsltSource);
StringWriter stringWriter = new StringWriter();
transformer.transform(documentSource, new StreamResult(stringWriter));
String transformedDocument = stringWriter.toString().trim();
(The Transformer is an instance of net.sf.saxon.Controller.)
The trick on the command line is to specify "-it:main" to point right at the named template in the XSLT. This means you don't have to provide the source file with the "-s" flag.
The problem starts again on the Java side. Where/how would I specify this "-it:main"? Wouldn't doing so break other XSLT's that don't need that specified? Would I have to name every template in every XSLT file "main?" Given the method signature of Transformer.transform(), I have to specify the source file, so doesn't that defeat all the progress I've made in figuring this thing out?
Edit: I found the s9api hidden inside the saxon9he.jar, if anyone is looking for it.

You are using the JAXP API, which was designed for XSLT 1.0. If you want to make use of XSLT 2.0 features, like the ability to start a transformation at a named template, I would recommend using the s9api interface instead, which is much better designed for this purpose.
However, if you've got a lot of existing JAXP code and you don't want to rewrite it, you can usually achieve what you want by downcasting the JAXP objects to the underlying Saxon implementation classes. For example, you can cast the JAXP Transformer as net.sf.saxon.Controller, and that gives you access to controller.setInitialTemplate(); when it comes to calling the transform() method, just supply null as the Source parameter.
Incidentally, if you're writing code that requires a 2.0 processor then I wouldn't use TransformerFactory.newInstance(), which will give you any old XSLT processor that it finds on the classpath. Use new net.sf.saxon.TransformerFactoryImpl() instead, which (a) is more robust, and (b) much much faster.

Related

Converting java xml sax event calls to an xml string

Does java xml sax api provide a ContentHandler subclass which would convert the event calls to an xml string. For example, the following calls to this handler should produce the following xml:
XMLPrinterHandler h;
String data = "hello";
h.startDocument();
h.startElement("", "element", "element", new Attributes());
h.characters(h.toCharArray(), 0, h.size());
h.endElement("", "element", "element");
h.endDocument();
System.out.println(h.getXml());
This should print:
<element>hello</element>
I'm dealing with some code which encodes some data as xml and would like to know the intermediate output. The encoding class takes a ContentHandler and calls the appropriate methods on it to encode the data.

You want:
SAXTransformerFactory f = new SAXTransformerFactory();
TransformerHandler t = new f.newTransformerHandler();
t.setResult(System.out);
t.startDocument();
etc
The TransformerHandler performs a "null transformation" from SAX input to lexical XML output.
You can also use
h.getTransformer().setOutputProperty()
to set serialization properties such as indenting, based on the properties defined in the XSLT specification. (The standard JDK TransformerHandler gives you XSLT 1.0 serialization properties, if you want the extended set defined in XSLT 3.0 plus Saxon extensions, use the Saxon implementation.)
Personally I find that writing Java code as a direct client of the SAX ContentHandler interface is very clumsy. I much prefer the XMLStreamWriter interface.

Document to String using DocumentBuilderFactory?

I am trying to find a way to convert Document to String and found this XML Document to String? post here. But, I want to do the conversion without using TransformerFactory because of XXE Vulnerabilities and by using DocumentBuilderFactory only. I cannot upgrade to jdk8 because of other limitations.
I haven't had any luck so far with it; all the searches are returning the same code shown in the above link.
Is it possible to do this?

This is difficult to do, but since your actual problem is the security vulnerability and not TransformerFactory, that may be a better way to go.
You should be able to configure TransformerFactory to ignore entities to prevent this sort of problem. See: Preventing XXE Injection
Another thing that may work for your security concerns is to use TransformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING). This should prevent the problems that you're worried about. See also this forum thread on coderanch.

Setting FEATURE_SECURE_PROCESSING may or may not help, depending on what implementation TransformerFactory.getInstance() actually returns.
For example in Java 7 with no additional XML libraries on classpath setting transformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); does not help.
You can fix this by providing a Source other than StreamSource (which factory would need to parse using some settings that you do not control).
For example you can use StAXSource like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
transformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); // does not help in Java 7
Transformer transformer = transformerFactory.newTransformer();
// StreamSource is insecure by default:
// Source source = new StreamSource(new StringReader(xxeXml));
// Source configured to be secure:
XMLInputFactory xif = XMLInputFactory.newFactory();
xif.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);
xif.setProperty(XMLInputFactory.SUPPORT_DTD, false);
XMLEventReader xmlEventReader = xif.createXMLEventReader(new StringReader(xxeXml));
Source source = new StAXSource(xmlEventReader);
transformer.transform(
source,
new StreamResult(new ByteArrayOutputStream()));
Note the actual TrasformerFactory may not actually support StAXSource, so you need to test your code with the classpath as it would be on production. For example Saxon 9 (old one, I know) does not support StAXSource and the only clean way of "fixing" it that I know is to provide custom net.sf.saxon.Configuration instance.

JAXB and XSLT processor

I am using JAXB and maven-jaxb2-plugin and I am able right now to bind my schemas to Java code successfully.
I also have a .xsl file "annotate_schemas.xsl" that modifies a specific schema adding some additional information.
Finally, on the schema that I want transformed, I added the header:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="annotate_schemas.xsl"?>
...
The problem is that, while the .xsl is correct (if I open my schema file in a browser, the transformation is done flawlessly), JAXB ignores it and binds an untouched version of my schema.
My question is: Does JAXB (and/or its plugin) have an XSLT processor?? Is there a way to tell JAXB to bind the result of the XSLT transformation instead of the original?
Thank you very much

JAXB, like the vast majority of XML-consuming applications, takes no notice of an <?xml-stylesheet?> processing instruction. If you want to transform a document before passing it to JAXB, you need to transform it explicitly, for example by using the JAXP transformation API. (There is an option in JAXP to request transformation according to the value of the xml-stylesheet PI if that's how you want to control it: TransformerFactory.useAssociatedStylesheet()).

You can try something like this:
TransformerFactory transFact = TransformerFactory.newInstance();
Templates displayTemplate = transFact.newTemplates(new StreamSource(new File("your_xsl_file")));
TransformerHandler handler =
((SAXTransformerFactory) transFact).newTransformerHandler(displayTemplate);

Storing html values in xml

Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.

There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.

I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.

Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.

TagSoup
JSoup
Beautiful Soup

Post-Process-Step for XSL

I'm currently working on a project which uses XSL-Transformations to generate HTML from XML.
On the input fields there are some attributes I have to set.
Sample:
<input name="/my/xpath/to/node"
class="{/my/xpath/to/node/#isValid}"
value="{/my/xpath/to/node}" />
This is pretty stupid because I have to write the same XPath 3 times... My idea was to have some kind of post-processor for the xsl file so i can write:
<input xpath="/my/xpath/to/node" />
I'm using using something like that to transform my xml
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import org.dom4j.Document;
import org.dom4j.io.DocumentResult;
import org.dom4j.io.DocumentSource;
public class Foo {
public Document styleDocument(
Document document,
String stylesheet
) throws Exception {
// load the transformer using JAXP
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource( stylesheet )
);
// now lets style the given document
DocumentSource source = new DocumentSource( document );
DocumentResult result = new DocumentResult();
transformer.transform( source, result );
// return the transformed document
Document transformedDoc = result.getDocument();
return transformedDoc;
}
}
My hope was that I can create a Transformer object out of a Document object. But it seems like it has to be a file path - at least I can't find a way to use a Document directly.
Anyone knows a way to achieve what I want?
Thanks

Why not skip the postprocessing, and use this in XSLT:
<xsl:variable name="myNode" select="/my/xpath/to/node" />
<input name="/my/xpath/to/node"
class="{$myNode/#isValid}"
value="{$myNode}" />
That gets you closer.
If you really want to DRY (as apparently you do), you could even use a variable myNodePath for which you generate the value from $myNode via a template or user-defined function. Does the name really have to be an XPath expression (as opposed to a generate-id()?)
Update:
Example code:
<xsl:variable name="myNode" select="/my/xpath/to/node" />
<xsl:variable name="myNodeName">
<xsl:apply-template mode="generate-xpath" select="$myNode" />
</xsl:variable>
<input name="{$myNodeName}"
class="{$myNode/#isValid}"
value="{$myNode}" />
The template for generate-xpath mode is available on the web... For example, you can use one of the templates for that purpose that comes with Schematron. Go to this page, download iso-schematron-xslt1.zip, and look at iso_schematron_skeleton_for_xslt1.xsl. (If you're able to use XSLT 2.0, then download that zip archive.)
In there you'll find a couple of implementations of schematron-select-full-path, which you can use for generate-xpath. One version is precise and is best for consumption by a program; another is more human-readable. Remember, for any given node in an XML document, there are multitudes of XPath expressions that could be used to select only that node. So you probably won't be getting the same XPath expression that you came in with at the beginning. If this is a deal-breaker, you may want to try another approach, such as ...
generating your XSLT stylesheet (the one you've already been developing, call it A) with another XSLT stylesheet (call it B). When B generates A, B has the chance to output the XPath expression both as a quoted string, and as an expression that will be evaluated. This is basically preprocessing in XSLT instead of postprocessing in Java. I'm not really sure if it would work in your case. If I knew what the input XML looks like, it would be easier to figure that out I think.

My hope was that I can create a Transformer object out of a Document object. But it seems like it has to be a file path - at least I can't find a way to use a Document directly.
You can create a Transformer object from a document object:
Document stylesheetDoc = loadStylesheetDoc(stylesheet);
// load the transformer using JAXP
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new DOMSource( stylesheetDoc )
);
Implementing loadStylesheetDoc is left as an excercise. You can build the stylesheet Document internally or load it using jaxp, and you could even write the changes to it you need as another XSLT transform transforming the stylesheet.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Saxon in Java: XSLT for CSV to XML - java

Related

Converting java xml sax event calls to an xml string

Document to String using DocumentBuilderFactory?

JAXB and XSLT processor

Storing html values in xml

Post-Process-Step for XSL

Categories

Resources