HTML to XML Conversion using XSLT in java - java

Hi Can anyone help me in html to xml conversion using xslt in java.I converted xml to html using xslt in java.This is the code i used for that converstion:
import javax.xml.transform.*;
import java.net.*;
import java.io.*;
public class HowToXSLT {
public static void main(String[] args) {
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer =
tFactory.newTransformer
(new javax.xml.transform.stream.StreamSource
("howto.xsl"));
transformer.transform
(new javax.xml.transform.stream.StreamSource
("howto.xml"),
new javax.xml.transform.stream.StreamResult
( new FileOutputStream("howto.html")));
}
catch (Exception e) {
e.printStackTrace( );
}
}
}
But i dont know the reverse process of this program that is to convert html to xml? Is there is any jar files available to do that? please help me...

Generally, it isn't possible to "reverse" a transformation, because a transformation in the general case isn't a 1:1 mapping.
For example, if the transformation does this:
<xsl:value-of select= "/x * /x"/>
and we get as result: 16
(and we know that the source XML document had only one element),
it isn't possible to determine from the value 16 whether the source XML document was:
<x>4</x>
or whether it was:
<x>-4</x>
And the above was only a simple example! :)

This will depend on what you wish to do exactly.
Apparently, howto.xsl contains the rules to be applied on the xml to get the html.
You will have to write another xsl file to do the reverse.

I believe it is not possible. XLST input must be XML conforming and HTML is not conforming to XML (unless you talk about XHTML).

May be you need to first make your html xhtml complaint, then use a xsl (reverse of the original xsl)which has instruction to convert the xhtml file to xml.

Its not possible, you can use Microsoft.XMLDOM for converting from HTML to XML.

Related

check XML document structure

Im parsing XML documents with java. Every document has root tag (it is a string) and a number of tags with text(unknown number) for example(check code in codebox). <AnyStrYouwant> tags have a string of characters in its body.
<anyRoot>
<AnyStrYouwant1>anyTextYouWant1</AnyStrYouwant1>
<AnyStrYouwant2>anyTextYouWant2</AnyStrYouwant2>
...
</anyRoot>
How programically(in java) chek if some file suits this structure. I can parse XML, I know that there is DTD(for example) that can check XML file with known format (tag names and content). What shall I use in this case?
PS: some people advice me to use XSD. But if I want to validate elements I need to know root element name. I dont know root element name (every file has own root element).
I cant comment with my new account but yes you can use DTD, Schematron
Schematron is much more flexible and it is industry standart where DTD is really a legacy technology but still widely used. DTD will check for allowed tags (in short) where Schematron is able to check the structure of the file for example that some special tags should be in first 10 lines of XML etc.
I would use DTD if you are only checking for existing tags and attributes allowed values.
If you do something more complex I would recommend using Schematron with its rules based validation.
You can use DTD or XSD to validate XML, take a look at :
http://www.w3schools.com/xml/xml_dtd.asp
http://www.journaldev.com/895/how-to-validate-xml-against-xsd-in-java
XSD is the advanced technique to validate XML, it's more flexible than DTD but you can use one of those technologies to solve your problem.
You can check XML with XSD using this sample code.
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.SchemaFactory;
import org.xml.sax.InputSource;
public boolean isValidXML(InputStream is) {
InputSource isrc;
try {
isrc = new InputSource(new FileInputStream("path/your-xsd-file.xsd")));
SAXSource sourceXSD = new SAXSource(isrc);
SchemaFactory
.newInstance("http://www.w3.org/2001/XMLSchema")
.newSchema(sourceXSD).newValidator()
.validate(new StreamSource(is));
} catch (Exception e) {
return false;
}
return true;
}

How to support not well formed XHTML for XSLT

I've got an arbitrary XHTML document which are usually not well formed, since websites can be made like that and browser will show it. How can I support XSLT translation for not well formed XHTML code? Is there a way that it can avoid those parts which are not well formed?
I have this code in Java, but as I've said it's not supporting not well formed XHTML:
try {
TransformerFactory tFactory=TransformerFactory.newInstance();
Source xslDoc=new StreamSource("path1");
Source xmlDoc=new StreamSource("path2");
String outputFileName="path3";
OutputStream htmlFile=new FileOutputStream(outputFileName);
Transformer trasform=tFactory.newTransformer(xslDoc);
trasform.transform(xmlDoc, new StreamResult(htmlFile));
}
catch (Exception e) {...}
You can use JSoup library to parse and fix your HTML and then use XSLT.
You can try to use an HTML parser like http://about.validator.nu/htmlparser/ or like TagSoup.

Saxon in Java: XSLT for CSV to XML

Mostly continued from this question: XSLT: CSV (or Flat File, or Plain Text) to XML
So, I have an XSLT from here: http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html
And it converts a CSV file to an XML document. It does this when used with the following command on the command line:
java -jar saxon9he.jar -xsl:csv-to-xml.csv -it:main -o:output.xml
So now the question becomes: How do I do I do this in my Java code?
Right now I have code that looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
StreamSource xsltSource = new StreamSource(new File("location/of/csv-to-xml.xsl"));
Transformer transformer = transformerFactory.newTransformer(xsltSource);
StringWriter stringWriter = new StringWriter();
transformer.transform(documentSource, new StreamResult(stringWriter));
String transformedDocument = stringWriter.toString().trim();
(The Transformer is an instance of net.sf.saxon.Controller.)
The trick on the command line is to specify "-it:main" to point right at the named template in the XSLT. This means you don't have to provide the source file with the "-s" flag.
The problem starts again on the Java side. Where/how would I specify this "-it:main"? Wouldn't doing so break other XSLT's that don't need that specified? Would I have to name every template in every XSLT file "main?" Given the method signature of Transformer.transform(), I have to specify the source file, so doesn't that defeat all the progress I've made in figuring this thing out?
Edit: I found the s9api hidden inside the saxon9he.jar, if anyone is looking for it.
You are using the JAXP API, which was designed for XSLT 1.0. If you want to make use of XSLT 2.0 features, like the ability to start a transformation at a named template, I would recommend using the s9api interface instead, which is much better designed for this purpose.
However, if you've got a lot of existing JAXP code and you don't want to rewrite it, you can usually achieve what you want by downcasting the JAXP objects to the underlying Saxon implementation classes. For example, you can cast the JAXP Transformer as net.sf.saxon.Controller, and that gives you access to controller.setInitialTemplate(); when it comes to calling the transform() method, just supply null as the Source parameter.
Incidentally, if you're writing code that requires a 2.0 processor then I wouldn't use TransformerFactory.newInstance(), which will give you any old XSLT processor that it finds on the classpath. Use new net.sf.saxon.TransformerFactoryImpl() instead, which (a) is more robust, and (b) much much faster.

Storing html values in xml

Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.
I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.
I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.
i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.
There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.
Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:
TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);
In Jsoup, simply parse the input and serialize it into a string:
String text = Jsoup.parse("<html><div><p>test").outerHtml();
And convert that string into a W3C Document using one of the methods described here:
How to parse a String containing XML in Java and retrieve the value of the root node?
You can now use the standard JAXP interfaces to transform this document:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.
I would use HTMLAgilityPack or Chris Lovett's SGMLReader.
Or, simply HTML Tidy.
Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.
I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.
TagSoup
JSoup
Beautiful Soup

Post-Process-Step for XSL

I'm currently working on a project which uses XSL-Transformations to generate HTML from XML.
On the input fields there are some attributes I have to set.
Sample:
<input name="/my/xpath/to/node"
class="{/my/xpath/to/node/#isValid}"
value="{/my/xpath/to/node}" />
This is pretty stupid because I have to write the same XPath 3 times... My idea was to have some kind of post-processor for the xsl file so i can write:
<input xpath="/my/xpath/to/node" />
I'm using using something like that to transform my xml
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import org.dom4j.Document;
import org.dom4j.io.DocumentResult;
import org.dom4j.io.DocumentSource;
public class Foo {
public Document styleDocument(
Document document,
String stylesheet
) throws Exception {
// load the transformer using JAXP
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource( stylesheet )
);
// now lets style the given document
DocumentSource source = new DocumentSource( document );
DocumentResult result = new DocumentResult();
transformer.transform( source, result );
// return the transformed document
Document transformedDoc = result.getDocument();
return transformedDoc;
}
}
My hope was that I can create a Transformer object out of a Document object. But it seems like it has to be a file path - at least I can't find a way to use a Document directly.
Anyone knows a way to achieve what I want?
Thanks
Why not skip the postprocessing, and use this in XSLT:
<xsl:variable name="myNode" select="/my/xpath/to/node" />
<input name="/my/xpath/to/node"
class="{$myNode/#isValid}"
value="{$myNode}" />
That gets you closer.
If you really want to DRY (as apparently you do), you could even use a variable myNodePath for which you generate the value from $myNode via a template or user-defined function. Does the name really have to be an XPath expression (as opposed to a generate-id()?)
Update:
Example code:
<xsl:variable name="myNode" select="/my/xpath/to/node" />
<xsl:variable name="myNodeName">
<xsl:apply-template mode="generate-xpath" select="$myNode" />
</xsl:variable>
<input name="{$myNodeName}"
class="{$myNode/#isValid}"
value="{$myNode}" />
The template for generate-xpath mode is available on the web... For example, you can use one of the templates for that purpose that comes with Schematron. Go to this page, download iso-schematron-xslt1.zip, and look at iso_schematron_skeleton_for_xslt1.xsl. (If you're able to use XSLT 2.0, then download that zip archive.)
In there you'll find a couple of implementations of schematron-select-full-path, which you can use for generate-xpath. One version is precise and is best for consumption by a program; another is more human-readable. Remember, for any given node in an XML document, there are multitudes of XPath expressions that could be used to select only that node. So you probably won't be getting the same XPath expression that you came in with at the beginning. If this is a deal-breaker, you may want to try another approach, such as ...
generating your XSLT stylesheet (the one you've already been developing, call it A) with another XSLT stylesheet (call it B). When B generates A, B has the chance to output the XPath expression both as a quoted string, and as an expression that will be evaluated. This is basically preprocessing in XSLT instead of postprocessing in Java. I'm not really sure if it would work in your case. If I knew what the input XML looks like, it would be easier to figure that out I think.
My hope was that I can create a Transformer object out of a Document object. But it seems like it has to be a file path - at least I can't find a way to use a Document directly.
You can create a Transformer object from a document object:
Document stylesheetDoc = loadStylesheetDoc(stylesheet);
// load the transformer using JAXP
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new DOMSource( stylesheetDoc )
);
Implementing loadStylesheetDoc is left as an excercise. You can build the stylesheet Document internally or load it using jaxp, and you could even write the changes to it you need as another XSLT transform transforming the stylesheet.

Categories