I have a requirement where I need to clip some rectangular part of OCRed pdf (Initially PDF was Scanned so we have perform OCR) into image.
I was not able to find any library which can achieve this. So I have splitted into two parts.
1. Clip Rectangular part from PDF using iText. The result will be in PDF.
2. Convert clipped PDF into images using pdfBox.
But in the process of converting clipped PDF into images using pdfBox the result is not as expected. As for eg we are not able to get checkbox in JPEG image if the clipped pdf contain only checkbox.
I have searched in StackOverflow for all the possible solution but with no success.
My code is same as the solution provided by Tilman Hausherr here. Ihave also tried this
Is there any direct way to achivve the above two steps in one or get some better way to convert pdf to image.
Please don't mark it as duplicate as I am not able to get the solution even after too many search.
Related
I have to print hundreds of single page PDF files and I wonder if I can detect if a PDF is black and white or color, I want to send black and white pdfs to a print queue and color ones to another.
I'm processing these pdfs in Java, can someone suggest me some technique?
I think you could give a try to PDFBox
In this example they extract an .icc (color profile) so may be what you are looking for.
How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)
I know there are many suggestions how to resolve problems with edition existing PDF, but among all of those, I couldn't find a solution for my problem.
I need to add information about file acceptance ("Document accepted by Tom Smith, 2020-01-01" - possible multiple acceptations) to the last page of the PDF. I need to determine if page is filled or is there enough space for my text.
I wanted to find position (y) of the last element on the last page of the pdf to check it against page size. If the page is full I'm going to add a new page and then add new text.
I have no idea how to resolve this. I tried looking for answers with iText and PDFBOX, but there are no satisfying resolutions.
Raster Image based approach:
Render the last page of the pdf to a bitmap image with any library you are comfortable with (Ghostscript?). 72 dpi should be enough for your purpose.
Then you can use any image processing library like openCV and check rectangular areas starting from the bottom up, if pixel exist. openCV is very fast with the CountNonZero() function.
You can also find any large white zone that is anywhere in the Image, not just at the bottom. This link could be your starting point.
https://answers.opencv.org/question/72939/how-to-find-biggest-white-zone-in-an-scanned-image/
I have two PDF files, one containing text (also available in RTF format) and one containing a background image. Now, I'd like to merge those two pages into one by putting the text pdf above the background pdf.
One solution I came across was to use this approach to convert the pdf files to images and draw them onto a new page. However, I wonder if this is the best solution since the PDF won't contain any text afterwards that you could copy&paste but just images.
Is there a solution to merge two pages without converting them to images first?
Similar to this SO PDFBox - PDF to Image losing barcode
The PDF in question: https://drive.google.com/file/d/0B13zTPQR9uxscXRMWjhsZ0doa00/view?usp=sharing
There is minimal text, and a medium sized QR Code. I have tried many different solutions to convert this PDF page to an image using PDFBox/ImageIO, but so far the QR Code is always missing from the result.
When I use PDFBox's PDFImageWriter I get this log:
ColorSpace Pattern doesn't provide a non-stroking color, using white instead!
I'm thinking that pertains to the QR Code.
Is this expected behavior? Can someone else confirm PDFBox cannot copy the QR Code from this PDF? Is there any way to convert this to an image using Java or PDFBox?
I'm developing a java application that extracts text from a pdf file, for this reason I use PDFBox library. I can extract most of the text from the file, but I can not extract certain words that have a larger size and a different color.
I have tried to use the method setAverageCharTolerance of pdfstripper class, but I have not had any results.
I wonder if anyone knows of any way to get extract all the text from the pdf file, whether this is large or small or different colors?