I currently use iText for PDF generation and I am having difficulty determining if the PDF files generated with iText are text-based or image-based once generated. Is there an easy way to determine that programmatically (or to specify one option or the other at the time of generation)?
iText generates PDF text instructions for text, and PDF image XObjects for images. Some other elements (e.g. borders) are generated as PDF graphics instructions. So I suppose you could say it generates "text-based" files.
Related
With iText 5 (java), is it possible to save a PDF file as a linearized PDF, also called sometimes “Web Optimized” or “Fast Web View” enabled PDF ?
iText 5 does not have a feature for saving linearized PDFs.
Actually this would not fit the iText 5 architecture at all which attempts to write data to its target output stream as early as possible, and you cannot do that while creating a linearized PDF.
Nonetheless, you can of course create a PdfStamper-like class which takes a PdfReader representing an existing PDF and stores it as a linearized PDF. This will require quite some coding, though.
I'm working on an e-learning project. I have the pdf file's and I have to do the validation, that is pdf file contain the RGB/CMYK color profile or not.
If color profile is found RGB that is require to reject the file. I have tried so much but did not get appropriate logic/answer.
If anyone has any idea that how can i do this in itext or other java pdf library. please suggest me.
iText is for producing PDF files programmatically (e.g. converting from HTML to PDF, or producing PDF reports.) - it is only for producing, not for rendering, so you can't use it to check color.
In order to check color of a pixel in the PDF document
you need to render it to a BufferedImage or so, and then take the color of pixel in specific (x,y) position.
To render PDF you could use a library like ICEpdf, jpedal.
There is a topic on SO about java pdf renderer libraries Java PDF Renderer
Can we repack a pdf by reducing images' size(in bytes not dimensions) using itext or pdfbox or another java pdf library?
Any java library?
How to make searchable text using any java library?
Open source or Paid.
how to apply OCR to pdf using PDFBox?
how to make pdf text searchable programmatically using pdfbox
I searched alot. Didn't find any solution.
Can anyone paste code for OCR PDFBox.
Try Apache PDFBox.
To extract text: Textextraction.
Any java library? How to make searchable text using any java library? Open source or Paid.
You can achieve this using Gnostice XtremeDocumentStudio for Java. For more details, follow the link below.
http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_in_Java
FYI, in the article, we have demonstrated how to convert scanned image to searchable PDF. In fact the input can be any scanned document (images, PDF or DOCX).
Disclaimer: I work for Gnostice.
You can use PDFBox to extract images from a PDF file, and then use the OCR system of your choice (for example, Tesseract) to obtain the text. Alternatively, if the PDF is mixed text and images, you can use Ghostscript to create an image of each PDF page, and then run OCR.
If you then need a searchable PDF file, build a new PDF by writing the text first, and then drawing the image over top of the text. The text will be searchable, but you will only see the image.
Note that OCR engines like Tesseract and Google Vision will return positional information for each word, so you will be able to place the text in the correct position.
I'm developing an web application using Flex and JSP.
I am having some performance issues with displaying multiple PDF files.
I am trying to display about 50-100 PDF files. I know that is a little crazy.
Hence, I made the project to convert PDF files to JPG format and display the JPG files.
I'm wondering if there is a way to decrease the file size of PDF to size of JPG.
Additionally, I would like to seek other way that may improve the performance.
Does anyone know a good way to display many PDF files (that will be mostly just text) for web application? Or, should I just have it display JPG files?
If the PDF files are mostly text you should probably use HTML. Is there something that would prevent you from making regular pages from your PDFs?
You can convert the PDF to rtf text file, use the text from rtf file to populate your HTML page perhaps in a table.
Check out ghostscript lib for doing this conversion.