Extract text from PDF (google app engine)

Extract text from PDF (google app engine) - java

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.

iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.

PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html

I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit

I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.

Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Related

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.

We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.

Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

how to convert a HTML web page into a PDF file using Java

i've been searching on the internet on how to convert a HTML page into a PDF file using Java. i found a lot of pointers, and in short, they don't work or are too difficult to implement. i also downloaded a commercial product, pdf4ml; the API is something i'd be happy to work with, except that when i crawled a simple page on wikipedia, i get a out of memory error (setting Xmx to 1024 M). in some approaches, they suggest converting HTML -> XHTML -> FO -> PDF. however, i am getting a lot of exceptions for the XHTML-to-FO XLS file; and reading the documentations, it's not something that i have enough time to understand right now.
here are my questions/concerns.
1. is there another cohesive API out there that will easily convert HTML to PDF (commercial or not)?
2. is there a way i can simply capture a HTML page and store it as a single file. this approach would be similar to using internet explorer's way of saving a web page as a web archive (single file, MHT format)?
any help is appreciated. (btw, i know this question has been asked repeatedly, but in addition to the original spirit of the question, i'm opened to other ways). thanks.

Try wkhtmltopdf, which is using WebKit. Another option (I'm using that currently) is using OpenOffice (remote controlled via macros).

you may use iText open source Java lib for that, and read this
or use YaHPConverter open source Java lib.
or do this whith help of icepdf popular open source lib
or use pd4ml, but it not free, only trial.
or use this, and this is man for it.

My 2 cents using opensource tools:
You can use either Capture screenshots with Selenium or WebDriver to save html page's screenshot in an image file from your Java code. And once you have image file you can convert it to pdf again from your Java code.
EDIT:
It seems you can do all that in 1 step using itext Html to Pdf

I am not sure but you could Try
1) cobra html rendering engine http://lobobrowser.org/cobra.jsp
2) htmleditorkit -- part of jdk
3) JWebPane
Use the rendering kit to parse and render html. The rendered out put is a swing component. Swing component can be used by itext to generate pdf file out put

You can try out Pdfcrowd. It is an easy to use commercial online API with many options and with support for Java.
It can create PDF either from web pages or raw HTML code.

How to read content of scanned pdf file in java / jsp or in javascript

How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply

You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.

What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.

Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )

Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java

Library/Tool to extract word coordinates from a pdf

I am looking for a (preferably Java-) library or a command line tool to extract word coordinates from pdfs. The input-pdfs contain either text or images with ocr-text in behind.
My Use Case:
In a Java web-application I would like to use this to do hit highlighting and present this without additional software (e.g. Adobe Reader etc.). Instead I want to convert the the matching pages into images and present them within a web page.

You should be able to use http://pdfbox.apache.org/ to do the highlighting and present them as pdf itself. Also look at http://itextpdf.com/.

You can use JPedal to generate the thumbnails (http://www.jpedal.org/pdf_thumbnail_tutorials.php) and extract the text (http://www.jpedal.org/support_egETAW.php)

Extracting the outline (or bookmarks) from PDF files using Java

I'm using PDFBox to extract the outline (bookmarks) information from PDF files, that's even explained in the same site.
However, I've had problems not extracting but generating the qualified urls (foo.pdf#page=22777&zoom=2,2,777) to open the PDF in those bookmarks. Sometimes PDFBox is not able to find the page in which the bookmark is placed (i.e. the page number, left coordinate or top coordinate are wrong.)
Anyone knows a PDF library capable to do this (preferably in Java)? Thanks.
Best regards,
Alexander.

iText (http://itextpdf.com) might work for you.
I've used it mostly to create PDFs (not so much with parsing already exitingones), but the library is good, and does have objects related to outlines and bookmarks.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract text from PDF (google app engine) - java

iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.

I know there is http://pdfbox.apache.org/index.html Apache PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. but I've never tested it.

Related

PDF Open Office or MS Word

how to convert a HTML web page into a PDF file using Java

How to read content of scanned pdf file in java / jsp or in javascript

Library/Tool to extract word coordinates from a pdf

Extracting the outline (or bookmarks) from PDF files using Java

Categories

Resources