Apache Tika Output Format - java

I have an requirement where pdf files comes as an input and I have to read it and based some rules, I have to split each page of pdf. Rules will be drive based on data which will gets extracted from the given pdf.
I gone through with Apache Tika Toolkit which suppose to be build for such requirement, I believe. The data is getting extracted using this tool but in text format. I want the output back in pdf format. I am not sure whether its possible to not. Please suggest.
Thanks.
Manish.

Related

How to verify text/content in thousands of PDF files

I want to verify/assert certain set of text or sentence in each PDF files automatically. I have 1000s of PDF files which needs to be verified whether a specific text/sentence is present in it.
You can do this by using Apache Lucene and Apache pdfbox.
Please refer to this post: http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html

Apache FOP XML input to Plain Text Output

I am working on a project, part of it is to produce a PDF file using Apache FOP, and also produce a plain text output based on the same data as the PDF. Currently I store the data in XMLDOM, which I pass to Apache FOP to produce the PDF, I am looking to also use Apache FOP to produce a Plain Text file using the same XMLDOM input. However, I am not one to work with Apache FOP and cannot find example .xslt to produce a Plain Text file. Does anyone has an example of input .XML data and .xslt to process against said file to produce some kind of Plain Text file?
It's not logic I believe I need here, but XML->xslt->Plain Text output help with.
Apache FOP seems to be able to produce plain text output look at the TXT section here : http://xmlgraphics.apache.org/fop/0.95/output.html
Another "simple" solution is to use PDFBox to parse the generated PDF into txt.
But if you want to go the hard way and generate text with xsl:output method="text" (never tried). Here are some pointers : https://community.oracle.com/thread/94556?start=0&tstart=0 and http://www.w3schools.com/xsl/el_output.asp

Converting all document types(MS DOCs+TIFFs+JPGs) stored on Filenet CE to pdf

Currently we are developing a Java application to grab documents (all types can be stored on filenet) from Filenet and displaying it in pdf format, the issue is that I need a tool for converting all MS doc types (doc, docx, ppt, pptx,...) to pdf. I've tried Apache POI and iText but they just convert the Office 2007 formats and it just gets a plain text without any formatting and without images.
Second in image documents I already converted the images to pdf but I need to get the annotations made by IBM applet image viewer printed on the image. How can I get that?
I need to an opensource/free solution, any not-free solution (like Snowbound, adptel) will be rejected by the customer.
Any support will be appreciated
Check out docx4java, http://www.docx4java.org/trac/docx4j
Can handle .docx and .pptx exports and the documentation gives some info on how to handle older .doc files.

How to read content of scanned pdf file in java / jsp or in javascript

How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply
You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.
What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.
Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )
Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java

Extract text from PDF (google app engine)

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Categories