Extracting the outline (or bookmarks) from PDF files using Java - java

I'm using PDFBox to extract the outline (bookmarks) information from PDF files, that's even explained in the same site.
However, I've had problems not extracting but generating the qualified urls (foo.pdf#page=22777&zoom=2,2,777) to open the PDF in those bookmarks. Sometimes PDFBox is not able to find the page in which the bookmark is placed (i.e. the page number, left coordinate or top coordinate are wrong.)
Anyone knows a PDF library capable to do this (preferably in Java)? Thanks.
Best regards,
Alexander.

iText (http://itextpdf.com) might work for you.
I've used it mostly to create PDFs (not so much with parsing already exitingones), but the library is good, and does have objects related to outlines and bookmarks.

Related

Converting docx/ODT to image using Java

I am working on a Spring REST service based web application (UI is based on HTML5, backbone.js). The actual requirement is, an uploaded document (could be any document like excel, word, ppt, pdf etc) requires an preview option using which an user can view the document in the browser (user may or may not have office installed).
My idea is to convert the documents into images and display them to the user. On searching, i found multiple ways to convert a PDF to image but not much ODT to image (Note: I am looking for an open source). JODConverter, docx4j can be used to convert the documents to pdf. Then I can convert these PDFs to images. But is this the right way. Is there any other efficient way to achieve the same. Please suggest and point me to the right direction.
Thanks in advance.
Gopi
Yes, you won't do any better than .docx to .pdf to image. You really need a stable workflow, and this is as good as you'll find for this purpose, unless you're running on a Microsoft server and you have access to the official Microsoft Office stuff.
For previews, docx4j or similar will do just fine. Not everything converts perfectly, but it should be fine for a preview.

How to open, jump to a page, change page size (basic read only operations of PDF)

I am very beginner in this iText. I just want to know how to do basic things with PDF from Java.
How to open a PDF page so that the PDF file is opened in front of me?
How to jump to a page (by page number) so that I see seeked page in front of me?
How to change page size so that I see page size changed in front of me?
I know these things can be done using mouse and keyboard, but I just want to make a program that open PDF files according to parameters.
When I do search about iText, I just found topics that tackle creation/modification issues.
You have misunderstood what iText is for. It's not a library for PDF rendering, you won't be able to visualize a PDF file with iText.
As for your issue, most of the PDF viewers (Adobe Acrobat, Foxit Reader) accept parameters to achieve (partly) what you want.
It's just a matter of finding the proper reader. Remember that seeking advices about libraries, applications, frameworks is considered off-topic on StackOverflow, so your question is likely to be closed.

filtering pdf links from html source code

im about to write a class that takes a look on the html source code and filters all pdf links from it. the idea behind it is just take the parent link + the relative link..
basically it's working for
pdf
but in some cases it doesn't e.g. if the same pdf link is written as
pdf
or
pdf
(point and space) both are working links and goes to the same pdf in the same directory if they are parsed in browsers, but for the composition in my class completely useless.
i fixed the problem for the two cases above. the question is if there are other special cases in syntax where i should pay attention on.
You do not know what the link points to until you download the file.
I can have a link like http://www.mysite.com/pages/brochure.html which internally redirects to a PDF file.
So, if you're not in control of the links, or working on a particular section of your site, you're going to fail.
On the other hand, if you're working on a specific section of the site, where you know every PDF link has a .pdf estension, you can simply check the extension and not the whole path (don't know how's written in Java the .lastIndexOf("string") thing of C#).

how to convert a HTML web page into a PDF file using Java

i've been searching on the internet on how to convert a HTML page into a PDF file using Java. i found a lot of pointers, and in short, they don't work or are too difficult to implement. i also downloaded a commercial product, pdf4ml; the API is something i'd be happy to work with, except that when i crawled a simple page on wikipedia, i get a out of memory error (setting Xmx to 1024 M). in some approaches, they suggest converting HTML -> XHTML -> FO -> PDF. however, i am getting a lot of exceptions for the XHTML-to-FO XLS file; and reading the documentations, it's not something that i have enough time to understand right now.
here are my questions/concerns.
1. is there another cohesive API out there that will easily convert HTML to PDF (commercial or not)?
2. is there a way i can simply capture a HTML page and store it as a single file. this approach would be similar to using internet explorer's way of saving a web page as a web archive (single file, MHT format)?
any help is appreciated. (btw, i know this question has been asked repeatedly, but in addition to the original spirit of the question, i'm opened to other ways). thanks.
Try wkhtmltopdf, which is using WebKit. Another option (I'm using that currently) is using OpenOffice (remote controlled via macros).
you may use iText open source Java lib for that, and read this
or use YaHPConverter open source Java lib.
or do this whith help of icepdf popular open source lib
or use pd4ml, but it not free, only trial.
or use this, and this is man for it.
My 2 cents using opensource tools:
You can use either Capture screenshots with Selenium or WebDriver to save html page's screenshot in an image file from your Java code. And once you have image file you can convert it to pdf again from your Java code.
EDIT:
It seems you can do all that in 1 step using itext Html to Pdf
I am not sure but you could Try
1) cobra html rendering engine http://lobobrowser.org/cobra.jsp
2) htmleditorkit -- part of jdk
3) JWebPane
Use the rendering kit to parse and render html. The rendered out put is a swing component. Swing component can be used by itext to generate pdf file out put
You can try out Pdfcrowd. It is an easy to use commercial online API with many options and with support for Java.
It can create PDF either from web pages or raw HTML code.

Extract text from PDF (google app engine)

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Categories