I have some PDF files generated based on some XSL-FO documents and I now need this content in HTML too. I am using FOP for creating the PDF files but this does not support HTML as an output format.
My question is this: Is there a Java library of some sort that can create HTML files based on XSL-FO documents, or can I do this with throwing XSLT at it. Can I somehow extend FOP to create me this type of output?
If XSLT is the only way to go, is there one already created? (I imagine I am not the first dude wanting this)
Thank you all!
You could use the Render-X provided FO2HTML stylesheet to convert the XSL-FO into XHTML output. It converts <block> elements into <div>, <inline> into <span>, etc.
I have used it, and it works great.
Related
I have this requirement to convert multiple DOCX files into HTML format and if possible into RTF
Docx4j seems to be a good java library for doing this.
Using the HtmlExporterNG2.html method is not necessarily giving out the desired result for me. So I am thinking of modifying the stylesheet that is extracted from the docx file and then using it for this conversion, as all these docx files have varying formatting and hence cannot use a standard stylesheet.
Am I correct in thinking that runtime tinkering with the stylesheet will work? and what are the important thigs I should be aware of?
I am using it as a standalone java application with java version 6.
My query might be a bit vague but am seeking for a right direction at this juncture.
#Jason I want to ignore certain formatting in the input docx. As the converted html had some extra spacing or junk characters etc added into it.
As a solution I created a new xslt. For most, it is very similar to the one in the sample but with few minor tweaks. The new xslt now converts the input docx file into a properly formatted(as I need) html for IE8, Mozilla or Chrome.
I am trying to convert a PDF document to a single HTML file in java. Most of the converters online converts one PDF file to multiple HTML files. I want to convert the whole PDF to a single HTML file.
Any suggestions?
Any suggestions?
You might always write some code using the JSoup API to write a single document that incorporates the body of each of the multiple HTML files. Combining styles & style-sheets (CSS) might be a bit more tricky (especially if the original HTML uses 'id' elements).
Though I find it hard to believe there is not a converter out there in which 'single document' is an option. I recommend searching further.
I think it should be possible to parse your PDF document with itext and then generate your html file.
I must admit I haven't checked if it is doable though.
Have you looked at http://www.jpedal.org/html_index.php which has an optiont to write to single file.
I am doing a project wherein I need to read an HTML file and identify specific tags, modify the contents of the tag, and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file?
Check out http://jsoup.org, it has a friendly dom-like API, for simple tasks you don't need to parse the html.
if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT
There are too many HTML parsers. You could use JTidy, NekoHTML or check TagSoup.
I usually prefer parsing XHTML with the standard Java XML Parsers, but you can't do this for any type of HTML.
Look at http://java-source.net/open-source/html-parsers for a list of java libraries that parse html files into java objects that can be manipulated.
If the html files you are working with are well formed (xhtml) then you can also use XML libraries in java to find particular tags and modify them. The IO itself should be handled by the particular libraries you are using.
If you choose to manually parse the strings you could use regular expressions to find particular tags and use the java io libraries to write to the files and create new html documents. But this method reinvents the wheel so to speak because you have to manage tag opening and closing and all of those things are handled by pre-existing libraries.
The basic API of JAVA that uses RTFEditorKit and HTMLEditorKit, is not able of recognize tags like <br/> and <table>.
So I have searched on internet a better way of converting HTML to RTF and i have found two solutions that seem to work.
JODConverter and HTML-to-RTFconverter. The first one needs OppenOffice installed to work and the second one uses DLL, so it can’t be used on Linux.
Does anyone know about other solution?
Thanks for any help!!!!
Do they want it in RTF or do they want it in Word format? There's a big difference.
Ensure your editor is generating XHTML (or convert it yourself with jtidy, htmlcleanup etc) then download the content as an XHTML but with a .doc extension and the MS Word mime type. Word 2003 or higher will open it as a word doc.
If it is valid html, you can use Apache-FOP.
There are stylesheets for transforming html to FO.
Apache FOP can write PDF and RTF as well.
http://www.torsten-horn.de/techdocs/java-xsl.htm#XSL-FO-Java
http://html2fo.sourceforge.net/index.html
You can take a look at RTF Template (http://rtftemplate.sourceforge.net/) Don't know if it fits your needs, but I used several times under Linux and was OK.
I already used the html-to-pdf and got the expected result. I have helped.
By RTF conversion there is an important issue to care about: a target RTF viewer. All of them declare RTF support, but, for instance, Notepad.exe can only show images in WMF format, it does not display headers and footers. TextEdit on MacOS can only deal with images embedded as a kind of active objects and has troubles with tables, OpenOffice is not tolerant to minor markup inconsistencies etc.
My favorite tool for HTML->RTF conversion is PD4ML - it produces clean, almost human-readable RTF markup and successfully solves another challenging problem for RTF generating tool - a support of nested tables (if you work with HTML - they are everywhere).
I want to generate a text format file using XML and XSLT using Java.
I know how to generate PDF format, but I have no idea about generating text format, i.e. what packages are needed, what are the changes needed in XSLT?
If anybody can provide the sample for this it would be a great help for me.
You just need an:
<xsl:output method="text" omit-xml-declaration="yes" />
element, and then just output text from your templates. No package needed.
David M shows how to get the raw text. However, you say you know how to generate PDF. Generating PDF directly from XSLT is a challenge. So perhaps the question means something else.
Are you using XSL FO or similar? In that case, IIRC, the Apache FOP allows generating formatted text as well as PDF (although perhaps not very well, not looked at it for ages). Other PDF generating tools may or may not also have a text output option.
If you run fop from commandline
fop -fo file.fo -txt file.txt
Or (if this is an embedded FOP)
Fop fop = fopFactory.newFop(MimeConstants.MIME_PLAIN_TEXT, out);