I am trying to convert a PDF document to a single HTML file in java. Most of the converters online converts one PDF file to multiple HTML files. I want to convert the whole PDF to a single HTML file.
Any suggestions?
Any suggestions?
You might always write some code using the JSoup API to write a single document that incorporates the body of each of the multiple HTML files. Combining styles & style-sheets (CSS) might be a bit more tricky (especially if the original HTML uses 'id' elements).
Though I find it hard to believe there is not a converter out there in which 'single document' is an option. I recommend searching further.
I think it should be possible to parse your PDF document with itext and then generate your html file.
I must admit I haven't checked if it is doable though.
Have you looked at http://www.jpedal.org/html_index.php which has an optiont to write to single file.
Related
I have this requirement to convert multiple DOCX files into HTML format and if possible into RTF
Docx4j seems to be a good java library for doing this.
Using the HtmlExporterNG2.html method is not necessarily giving out the desired result for me. So I am thinking of modifying the stylesheet that is extracted from the docx file and then using it for this conversion, as all these docx files have varying formatting and hence cannot use a standard stylesheet.
Am I correct in thinking that runtime tinkering with the stylesheet will work? and what are the important thigs I should be aware of?
I am using it as a standalone java application with java version 6.
My query might be a bit vague but am seeking for a right direction at this juncture.
#Jason I want to ignore certain formatting in the input docx. As the converted html had some extra spacing or junk characters etc added into it.
As a solution I created a new xslt. For most, it is very similar to the one in the sample but with few minor tweaks. The new xslt now converts the input docx file into a properly formatted(as I need) html for IE8, Mozilla or Chrome.
I am doing a project wherein I need to read an HTML file and identify specific tags, modify the contents of the tag, and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file?
Check out http://jsoup.org, it has a friendly dom-like API, for simple tasks you don't need to parse the html.
if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT
There are too many HTML parsers. You could use JTidy, NekoHTML or check TagSoup.
I usually prefer parsing XHTML with the standard Java XML Parsers, but you can't do this for any type of HTML.
Look at http://java-source.net/open-source/html-parsers for a list of java libraries that parse html files into java objects that can be manipulated.
If the html files you are working with are well formed (xhtml) then you can also use XML libraries in java to find particular tags and modify them. The IO itself should be handled by the particular libraries you are using.
If you choose to manually parse the strings you could use regular expressions to find particular tags and use the java io libraries to write to the files and create new html documents. But this method reinvents the wheel so to speak because you have to manage tag opening and closing and all of those things are handled by pre-existing libraries.
I have some PDF files generated based on some XSL-FO documents and I now need this content in HTML too. I am using FOP for creating the PDF files but this does not support HTML as an output format.
My question is this: Is there a Java library of some sort that can create HTML files based on XSL-FO documents, or can I do this with throwing XSLT at it. Can I somehow extend FOP to create me this type of output?
If XSLT is the only way to go, is there one already created? (I imagine I am not the first dude wanting this)
Thank you all!
You could use the Render-X provided FO2HTML stylesheet to convert the XSL-FO into XHTML output. It converts <block> elements into <div>, <inline> into <span>, etc.
I have used it, and it works great.
The basic API of JAVA that uses RTFEditorKit and HTMLEditorKit, is not able of recognize tags like <br/> and <table>.
So I have searched on internet a better way of converting HTML to RTF and i have found two solutions that seem to work.
JODConverter and HTML-to-RTFconverter. The first one needs OppenOffice installed to work and the second one uses DLL, so it can’t be used on Linux.
Does anyone know about other solution?
Thanks for any help!!!!
Do they want it in RTF or do they want it in Word format? There's a big difference.
Ensure your editor is generating XHTML (or convert it yourself with jtidy, htmlcleanup etc) then download the content as an XHTML but with a .doc extension and the MS Word mime type. Word 2003 or higher will open it as a word doc.
If it is valid html, you can use Apache-FOP.
There are stylesheets for transforming html to FO.
Apache FOP can write PDF and RTF as well.
http://www.torsten-horn.de/techdocs/java-xsl.htm#XSL-FO-Java
http://html2fo.sourceforge.net/index.html
You can take a look at RTF Template (http://rtftemplate.sourceforge.net/) Don't know if it fits your needs, but I used several times under Linux and was OK.
I already used the html-to-pdf and got the expected result. I have helped.
By RTF conversion there is an important issue to care about: a target RTF viewer. All of them declare RTF support, but, for instance, Notepad.exe can only show images in WMF format, it does not display headers and footers. TextEdit on MacOS can only deal with images embedded as a kind of active objects and has troubles with tables, OpenOffice is not tolerant to minor markup inconsistencies etc.
My favorite tool for HTML->RTF conversion is PD4ML - it produces clean, almost human-readable RTF markup and successfully solves another challenging problem for RTF generating tool - a support of nested tables (if you work with HTML - they are everywhere).
need to convert a pdf file to a doc file. I found different type of example to generate pdf file but not got pdf to doc.
What your asking is actually very difficult
I recommend you start here and look for a good parsing library. then you would have to write it out in .doc format. Inevitably a lot of the formatting and extra information would be lost. it would be a lot easier to output to docx format, but i assume thats not what your looking for.
I see few possible solutions:
Davisor Publishor 6.2 probably can be used, but it is commercial, and seems that generates only txt from pdf... just have a look
parse pdf with iText, and then
generate doc with Apache POI -
another way to try (free one ;)
look for command line tools, like
Convert PDF To DOC and execute
them from java
Otherwise take a look at Con's answer, there is a link to the list with java pdf processing libraries, maybe some library can do it directly, or can be used to parse pdf (better than iText), and then just use Apache POI to generate doc. Hope it helps ;)