How to generate a text file using XSLT - java

I want to generate a text format file using XML and XSLT using Java.
I know how to generate PDF format, but I have no idea about generating text format, i.e. what packages are needed, what are the changes needed in XSLT?
If anybody can provide the sample for this it would be a great help for me.

You just need an:
<xsl:output method="text" omit-xml-declaration="yes" />
element, and then just output text from your templates. No package needed.

David M shows how to get the raw text. However, you say you know how to generate PDF. Generating PDF directly from XSLT is a challenge. So perhaps the question means something else.
Are you using XSL FO or similar? In that case, IIRC, the Apache FOP allows generating formatted text as well as PDF (although perhaps not very well, not looked at it for ages). Other PDF generating tools may or may not also have a text output option.

If you run fop from commandline
fop -fo file.fo -txt file.txt
Or (if this is an embedded FOP)
Fop fop = fopFactory.newFop(MimeConstants.MIME_PLAIN_TEXT, out);

Related

Custom solution in Docx4J for converting Docx to HTML

I have this requirement to convert multiple DOCX files into HTML format and if possible into RTF
Docx4j seems to be a good java library for doing this.
Using the HtmlExporterNG2.html method is not necessarily giving out the desired result for me. So I am thinking of modifying the stylesheet that is extracted from the docx file and then using it for this conversion, as all these docx files have varying formatting and hence cannot use a standard stylesheet.
Am I correct in thinking that runtime tinkering with the stylesheet will work? and what are the important thigs I should be aware of?
I am using it as a standalone java application with java version 6.
My query might be a bit vague but am seeking for a right direction at this juncture.
#Jason I want to ignore certain formatting in the input docx. As the converted html had some extra spacing or junk characters etc added into it.
As a solution I created a new xslt. For most, it is very similar to the one in the sample but with few minor tweaks. The new xslt now converts the input docx file into a properly formatted(as I need) html for IE8, Mozilla or Chrome.

Convert PDF file to a single HTML file

I am trying to convert a PDF document to a single HTML file in java. Most of the converters online converts one PDF file to multiple HTML files. I want to convert the whole PDF to a single HTML file.
Any suggestions?
Any suggestions?
You might always write some code using the JSoup API to write a single document that incorporates the body of each of the multiple HTML files. Combining styles & style-sheets (CSS) might be a bit more tricky (especially if the original HTML uses 'id' elements).
Though I find it hard to believe there is not a converter out there in which 'single document' is an option. I recommend searching further.
I think it should be possible to parse your PDF document with itext and then generate your html file.
I must admit I haven't checked if it is doable though.
Have you looked at http://www.jpedal.org/html_index.php which has an optiont to write to single file.

Is there a library capable of generating XSL-FO from Office XML documents like DOCX, XLSX?

Does anyone know of a library that is capable of generating XSL-FO from a Microsoft Office Open XML file, such as a Word DOCX or an Excel XLSX?
Given that these Office files are basically XML in a ZIP file, I figure it would be pretty straightforward to generate XSL-FO from them by applying appropriate XSLT transformations — though writing the XSLT would take some time. But if it is a straightforward as I suspect, then maybe someone has written a library that does it, or released XSLT transformations that do it.
This Microsoft MSDN library article contains an example of creating XSL-FO with Word 2003 WordprocessingML files, but I haven't seen anything for the newer Open XML format.
Does anyone have suggestions? A Java library would be preferable, but anything would be considered.
docx4j has support for this, for docx; since v3.3.0 it is in a separate project https://github.com/plutext/docx4j-export-FO
It uses XSLT to create the XSL-FO. The XSLT uses Java extension functions to invoke docx4j methods to do much of the work, keeping the XSLT itself relatively simple.
docx4j uses FOP to convert XSL FO to PDF.
docx4j has support for xlsx, but no built in export from XLSX to XSL FO.
RenderX has a set of publicly available stylesheets that convert WordML into XSL-FO
http://www.renderx.com/tools/word2fo.html
These stylesheets were prepared by
RenderX's development team and
Microsoft for general use. They are
used to convert documents in
Microsoft's WordprocessingML XML
vocabulary into documents in the W3C's
XSL FO (XSLFO) vocabulary. These
generic stylesheets produce XSL FO
(XSLFO) suitable for RenderX XEP
Engine.

generating HTML from XSL-FO using Java

I have some PDF files generated based on some XSL-FO documents and I now need this content in HTML too. I am using FOP for creating the PDF files but this does not support HTML as an output format.
My question is this: Is there a Java library of some sort that can create HTML files based on XSL-FO documents, or can I do this with throwing XSLT at it. Can I somehow extend FOP to create me this type of output?
If XSLT is the only way to go, is there one already created? (I imagine I am not the first dude wanting this)
Thank you all!
You could use the Render-X provided FO2HTML stylesheet to convert the XSL-FO into XHTML output. It converts <block> elements into <div>, <inline> into <span>, etc.
I have used it, and it works great.

How to convert HTML ==> RTF in Java?

The basic API of JAVA that uses RTFEditorKit and HTMLEditorKit, is not able of recognize tags like <br/> and <table>.
So I have searched on internet a better way of converting HTML to RTF and i have found two solutions that seem to work.
JODConverter and HTML-to-RTFconverter. The first one needs OppenOffice installed to work and the second one uses DLL, so it can’t be used on Linux.
Does anyone know about other solution?
Thanks for any help!!!!
Do they want it in RTF or do they want it in Word format? There's a big difference.
Ensure your editor is generating XHTML (or convert it yourself with jtidy, htmlcleanup etc) then download the content as an XHTML but with a .doc extension and the MS Word mime type. Word 2003 or higher will open it as a word doc.
If it is valid html, you can use Apache-FOP.
There are stylesheets for transforming html to FO.
Apache FOP can write PDF and RTF as well.
http://www.torsten-horn.de/techdocs/java-xsl.htm#XSL-FO-Java
http://html2fo.sourceforge.net/index.html
You can take a look at RTF Template (http://rtftemplate.sourceforge.net/) Don't know if it fits your needs, but I used several times under Linux and was OK.
I already used the html-to-pdf and got the expected result. I have helped.
By RTF conversion there is an important issue to care about: a target RTF viewer. All of them declare RTF support, but, for instance, Notepad.exe can only show images in WMF format, it does not display headers and footers. TextEdit on MacOS can only deal with images embedded as a kind of active objects and has troubles with tables, OpenOffice is not tolerant to minor markup inconsistencies etc.
My favorite tool for HTML->RTF conversion is PD4ML - it produces clean, almost human-readable RTF markup and successfully solves another challenging problem for RTF generating tool - a support of nested tables (if you work with HTML - they are everywhere).

Categories