I have an XLSX file and i need to convert it to a equivalent pdf. I tried out many ways, from XLSX to HTML to PDF, using onlineconverter and libraries (including iText, FlyingSource, Apache POI) but no one worked fine, since text formatting gets lost.
I thought to another solution, use LibreOffice to print it as a PDF. I saw online it would be possible, but i did not understand how. I know i have to install it, and that's ok....but than?
Related
How to extract PDF file content in java completely as Text and render as HTML?
Not like extracting just text separately or just images separately, requirement is to display contents of PDF file (as like original file-means including images and tables right at place where it was in original file) as HTML content.
Some how same like the sample in the answer here Convert Word to HTML with Apache POI which extracts contents of MS Doc file into HTML using Apache POI.
Extracting data from a PDF file is fairly simple. There are multiple libraries out there that do it correctly. Extracting data, and preserving its layout, on the other hand (the workflow the OP describes) is a very difficult process. The reason behind it is simple - most* PDF files, don't really have any elements that define structure. When a PDF file, for example, displays a table, it's very easy for humans to see it, and understand this is indeed a table with some data in it. However, in the PDF file itself, this is a collection of vector lines, and some text runs in between. The PDF itself, or the PDF viewer, are not aware that this is a table. Therefore when this data is converted to HTML, we don't know that we need to draw a table, but instead see this as vector art. This is just one example of why this is difficult. There are many others that can be used to illustrate this point.
On the other hand, such a thing exists as "Tagged PDF" (section 10.7). It's a PDF where structure elements are actually defined, and extraction is fairly easy. However tagged PDF files are not as common as we would like, and in most cases you won't be guaranteed to work with one.
There are some tools on the market that use sophisticated logic to infer the structure of an untagged document. Some of them do a better job than others at this. I've worked with Adobe Acrobat, which does a decent job at creating an HTML file. There is also an offering from Datalogics (I work for Datalogics) called PDF Alchemist which converts PDF to HTML. Both of them are commercial solutions.
If you are looking for a free solution, PDFBox does a good job at extracting content from a PDF document. However, it doesn't have the ability to create an HTML file, and this is something that will have to be implemented outside of the library. I'm not aware of any free PDF to HTML solutions that do a good enough job, and I would be willing to recommend.
I am wanting to make a simple word counter for my latex documents so that I can double check my word count is accurate. More generally it is useful to discover whether java can interpret text from pdf files anyway. A google on it brought nothing up so I am thinking maybe not? If not, why?
You can't read text from a .pdf without a PDF file reader. Here are a couple of Java .pdf libraries:
Apache PDF Box
iText
See also this link, for an example of Java text extraction with PDF Box:
http://pdfbox.apache.org/userguide/text_extraction.html
I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.
Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.
I'm developing an web application using Flex and JSP.
I am having some performance issues with displaying multiple PDF files.
I am trying to display about 50-100 PDF files. I know that is a little crazy.
Hence, I made the project to convert PDF files to JPG format and display the JPG files.
I'm wondering if there is a way to decrease the file size of PDF to size of JPG.
Additionally, I would like to seek other way that may improve the performance.
Does anyone know a good way to display many PDF files (that will be mostly just text) for web application? Or, should I just have it display JPG files?
If the PDF files are mostly text you should probably use HTML. Is there something that would prevent you from making regular pages from your PDFs?
You can convert the PDF to rtf text file, use the text from rtf file to populate your HTML page perhaps in a table.
Check out ghostscript lib for doing this conversion.