I want to use pdfBox to extract test from Persian pdf files, but it returns "?" for all the Persian characters (it returns correctly the Latin words in the same document).
How can I fix it? Any advice?
Sadly, the provided file has the persian text as vector graphics, not as text from fonts, so it cannot be extracted. You'll have to use OCR for it.
See also the text extraction FAQ:
How come I am not getting any text from the PDF document?
Text extraction from a pdf document is a complicated task and there
are many factors involved that effect the possibility and accuracy of
text extraction. It would be helpful to the PDFBox team if you could
try a couple things.
Open the PDF in Acrobat and try to extract text from there. If Acrobat
can extract text then PDFBox should be able to as well and it is a bug
if it cannot. If Acrobat cannot extract text then PDFBox ‘probably’
cannot either.
It might really be an image instead of text. Some PDF documents are
just images that have been scanned in. You can tell by using the
selection tool in Acrobat, if you can’t select any text then it is
probably an image.
Related
I want to blur sensitive information in pdf file. I read about pyPdf in python and PDFBox of java but I could not get how to search and replace text in pdf file. By replacing I mean blur or even asterik character.
I also thought of a step in which I can take image of very page of pdf and then show them in html one by one. But then the same problem is there how to replace text in those images?
I'm writing to inquire a bout a problem when I'm using iText library to extract text contents from PDF file.I would able to extract all the text, but couldn't find the method to extract font styles.
First you need to read the answer to this question: how can i get text formatting with iTextSharp
In this question, you'll discover that the TextRenderInfo has a getFont() method that allows you to get the PostScriptFontName. If you are in luck, this PostScriptFontName will give you information about the style.
Note that this won't always work. Please read the answer to this question: What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp
That question shows an example of a font that doesn't reveal anything about its style.
To get the fontsize of a font in itext, use this code:
renderInfo.getAscentLine().getStartPoint().get(1)-
renderInfo.getDescentLine().getStartPoint().get(1)
This will give you the exact fontsize.
I have a library which generates pdf document with images.
I want to be able to add text after each image. What is the syntax for that? How to insert text into pdf documents?
I have to use the library I have, not another one.
First of all, mkl is correct, have a look at the specification for all of the details. PDF is an exact language, if you make mistakes they will routinely be punished severely once you open the PDF in viewers.
Secondly, when you think about putting text on the page, don't forget that besides the text operators to draw the text on the page, you'll also have to specify the font to use to draw this text. Which will include making sure there is a font resource included in the PDF file if your library doesn't automatically handle all of that for you.
If you want to cut corners (I shiver while writing this) and perhaps don't read the specification as thoroughly, try this.
1) Create a PDF file that looks more or less like what you want.
2) Use a tool such as pdfToolbox from callas (http://www.callassoftware.com/callas/doku.php/en:download) or Browser from Enfocus (http://www.enfocus.com/en/products/browser). Both of these tools allow you to investigate the low-level structure of a PDF file, including looking at the actual page description code. This will show you how fonts are embedded (if you have to do it yourself that could be very handy) and how text is rendered on the page (and how you set the font, size, color etc... to use).
I have a program. It outputs to pdf, but that is close to impossible to read from again. So i need a additional file attached to my PDF in order to be able to make it editable in my program. Attaching a file to PDF is a good idea, but that is visible to the user, which i don't wan't it to be.
An alternative is to hide my readable file format inside an image which would be added to the PDF somewhere to the top of the first page, before everything else... Even to metadata if that's possible...
That way I can extract image from pdf using a PDF library (iText), and read from it.
My question is how to add image to PDF to be as well hidden as it could be (visually and by accesibility). And it has to be in a place which would be same for any created document (somewhere on the top, or on the very bottom of the document, or to the part of the document which isn't displayed at all... I'm really guessing here, I'm not really familliar with PDF file format)...
Any ideas?
P.S. It's not really important which image is it, I could be a e.g. completly transparent image, 1x1 pixels.
I'm not sure what you mean by Image, but you can "extend" the PDF reference.
A PDF consists of objects: PDF numbers, PDF names, PDF strings, PDF arrays, PDF dictionaries, PDF streams. What you probably want, is to add an entry to a dictionary (pick one: the root dictionary, the info dictionary, the root of the page tree,...) that isn't defined in the PDF reference, so that it isn't rendered in a PDF viewer.
The key of such an entry must be a PDF name. To avoid clashes with existing names (names that are part of a current PDF spec, or will be part of a future spec), it is advised to register a four-letter key with ISO. For instance, Adobe registered adbe, iText registered ITXT and use that name with an underscore. For instance, ITXT_OriginalData would be a good name if we needed the functionality you describe.
The value of such an entry will be a PDF stream. In iText, you need the PdfStream class for this.
I have some PDF Files with similar layout.
For example, their introduction parts have same font color and size.
I want to extract introduction parts from these PDF Files with using this text property information but i could'nt find any method.
For example, i will give a parameter like #333333, and it returns text from PDF only in #333333 color. Is it possible?
I use iText library.
Thanks..