I have a library which generates pdf document with images.
I want to be able to add text after each image. What is the syntax for that? How to insert text into pdf documents?
I have to use the library I have, not another one.
First of all, mkl is correct, have a look at the specification for all of the details. PDF is an exact language, if you make mistakes they will routinely be punished severely once you open the PDF in viewers.
Secondly, when you think about putting text on the page, don't forget that besides the text operators to draw the text on the page, you'll also have to specify the font to use to draw this text. Which will include making sure there is a font resource included in the PDF file if your library doesn't automatically handle all of that for you.
If you want to cut corners (I shiver while writing this) and perhaps don't read the specification as thoroughly, try this.
1) Create a PDF file that looks more or less like what you want.
2) Use a tool such as pdfToolbox from callas (http://www.callassoftware.com/callas/doku.php/en:download) or Browser from Enfocus (http://www.enfocus.com/en/products/browser). Both of these tools allow you to investigate the low-level structure of a PDF file, including looking at the actual page description code. This will show you how fonts are embedded (if you have to do it yourself that could be very handy) and how text is rendered on the page (and how you set the font, size, color etc... to use).
Related
I'm using iText (Java lib) to process an already created PDF file.
What I would like to achieve is to replace fonts that are metric-compatible with a PDF base font with that PDF base font. This would make the PDF more "compliant" and potentially also smaller.
Here's how it would go:
Loop through the fonts used in the PDF.
If font is metric-compatible
with a PDF base font then replace font name with that font (but maintain the PDF resource name, e.g. /F13, so that we do not need to touch any text
objects). Since iText embeds in its jar the AFM files for the PDF
base fonts I'm assuming that iText actually has enough knowledge to
make this assesment. I would probably have to look at
serif/sans-serif and monotype flags as well to know if I should swap
in Helvetica, Times or Courier.
Further if metric-compatible: Remove
any font embeds for that font. (since we've replaced with a PDF base
font there's no need to embed anything .. size matters!)
An example:
An existing PDF file uses "Calibri", "Arial" and "Times". Here's how each of those should be handled.
Calibri. This font doesn't have a metric-compatible cousin among the PDF base fonts so processing for this font resource will be skipped.
Arial. This font has a metric-compatible cousin among the PDF base fonts, namely "Helvetica". The name of the font resource (attribute BaseFont I suppose) will be changed to "Helvetica" and any potential embeds will be removed.
Times. This font is already a PDF base font. Skip processing. (we may consider unembedding here if there's something to unembed, but I already know how to do that so not part of the question)
I basically get stuck on the step which is to determine metric-compatibility. Any help is greatly appreciated.
(Note: An answer based on iText 5.x is perfectly ok as I feel the recent iText 7 is still somewhat undocumented)
UPDATE
As pointed out a number of checks would need to be carried out in addition in order to do a safe replacement:
Font encoding compatibility. Not really a problem for me as fonts in the documents I'll be processing will be using WinAnsiEncoding.
Available chars in font. Not really a problem for me as I'll only be processing documents that use only ISO 8859-1 chars. Furthermore: If the PDF contains an embedded subset of a font then I'll have easily accessible knowledge about exactly which chars is used in the document for that font.
I'm sure I can figure out to check for both these conditions. (I'm blissfully naive)
I'm not trying to do a general tool. I know where the PDF's I'll be processing comes from. In any case I guess it is possible to have enough information from the PDF to skip the font substitution if it can't be determined that the substitution will be "safe".
I was going through the itext api docs & I was able create a pdf with a watermark image or text but did not find a method to get/extract watermark content from pdf.
So I have a pdf document containing watermarked text/image & I want to extract that text or img and validate which I am not able to do.
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
By validate I mean if I have an existing pdf/image with some watermarked text [as done in 2nd link in above ref], I want to check whether it has expected text/image.
References:
http://itextpdf.com/themes/keyword.php?id=226
http://www.java-connect.com/itext/add-watermark-in-PDF-document-using-java-iText-library.html
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
Extracting watermark content?
There is nothing special about watermarks in PDFs in contrast to regular page content. They merely
appear pretty early in the content stream and other content later in the stream, therefore, is drawn above it; or they
appear pretty late in the content stream but have some kind of transparency applied.
Actually there is another type of watermarks which is special, the so-called Watermark Annotations. As these annotation can easily be lost when documents are merged or otherwise manipulated, though, they hardly ever are used.
Furthermore different PDF generating software suites offering a way to add watermarks do so in their respective individual way. Thus, you cannot even recognize watermarks by some special operations done in some specific unique pattern.
Already the iText examples you referred to apply different kinds of watermarks
MovieCountries2 simply draws some gray large Text using an angled base line.
StampStationery copies a complete page from some PDF (which itself may visually have foreground and background material) into a separate object inside the target PDF and adds a reference to this object at the beginning of every page of the target.
InsertPages similarly references a page from some PDF on every newly generated target document page.
Thus, blind watermark extraction is virtually impossible.
Validating watermark content!
You might try some validation, though, if you know what you are searching for. You simply do not merely search some (in PDF not existing) fixed watermark stream but instead the whole page content.
iText offers the classes of the parser package which allow extraction of text and/or bitmap images from content streams. Look at the samples referenced from the keywords PARSING PDF > EXTRACTING IMAGES and PARSING PDF > EXTRACTING TEXT.
You merely have to check whether the image or text which you expect can be found by these classes positioned and styled as you expect.
Can i create an expandable list in pdf files. Expandable list will be of the form :
+Item1
+Item2
-Item3
-Subitem3.1
-Subitem3.2
+Item4
-Item5
-Subitem5.1
-Subitem5.2
-Subitem5.3
Also I need to create the pdf file from Java(I was thinking of using iText, is another library better/easier?). Is this possible. Or is a report in some other standard format(not pdf or html) an easier way out.
First this: I'm the creator of iText, so forgive me for not pointing you to other solutions ;-)
Now for your question: you're asking for dynamic functionality (a tree structure that opens/closes upon user interaction) inside a PDF document.
The most obvious answer is: this isn't possible. When creating PDF, think of paper. Can you print a tree structure on paper that opens/closes when the end user touches the paper? No, you can't, therefore you're asking something that isn't possible in PDF.
The less obvious answer is: it depends. What type of PDF are we talking about?
If you're talking about an interactive XFA form, then you may be able to achieve what you want. The XML Forms Architecture (XFA) is an XML specification that can be used to define interactive forms. When you use XFA, the PDF is nothing more than a container for XML. This XML is rendered dynamically inside Adobe Reader. How to create an XFA form? I only know about two products: Adobe LiveCycle Designer and Avoka Smart Forms Designer.
If you're talking about 'regular PDF', then one option is to embed a swf file. In this case, the tree structure will be rendered by Flash player (which could be a disadvantage, because this might not work with all PDF viewers). Another disadvantage: the tree structure will be confined to a fixed rectangle on a fixed page.
Finally: you can have create such a structure in the bookmarks panel. In PDF terminology, those bookmarks are called Outlines. Obviously, the tree structure won't be a part of the printable content. It will be visible in a separate panel in your PDF viewer.
I have been playing around with PdfBox and PDFTextStripperByArea method.
I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.
As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.
Here is my code so far:
List<TextPosition> textPos = charactersByArticle.get(index);
for (TextPosition t : textPos)
{
if (t.getFont().getFontDescriptor() != null)
{
if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
t.getFont().getFontDescriptor().isForceBold())
{
isBold = true;
}
if (t.getFont().getFontDescriptor().isItalic())
{
isItalic = true;
}
}
}
I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.
Any suggestions where this information could be retrieved from ?
Here is what I have found out so far:
PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.
If we take a look at the PDFTextStripper.properties resource file under:
pdfbox\src\main\resources\org\apache\pdfbox\resources\
we can see that for instance the BT operator is bound to the
org.apache.pdfbox.util.operator.BeginText class and so on.
The PDFTextStripper under
pdfbox\src\main\java\org\apache\pdfbox\util\
takes this into account and utilizes the processing of the PDF with this classes.
BUT all graphical objects are ignored, therefore no information of underline or table structure!
Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under
pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\
The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.
Now this would mean reading the PDF file specification, which is currently way to much work.
If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.
As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.
You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.
My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.
http://learnpostscript.wordpress.com/category/lineto/
PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.
However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.
Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.
PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.
Hope this helps!
you can use Itext to generate pdf reports.
by using itext you can able to put the lines in easy way.
try the follwing.
document.add(new LineSeparator(0.5f, 50, null, 0, 198));
the above code is used to generate lines in pdf report. and set the dimensions according to your choice.
hope this will help you.
As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.
According to the api getfont() returns The font size.
You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.
I have a program. It outputs to pdf, but that is close to impossible to read from again. So i need a additional file attached to my PDF in order to be able to make it editable in my program. Attaching a file to PDF is a good idea, but that is visible to the user, which i don't wan't it to be.
An alternative is to hide my readable file format inside an image which would be added to the PDF somewhere to the top of the first page, before everything else... Even to metadata if that's possible...
That way I can extract image from pdf using a PDF library (iText), and read from it.
My question is how to add image to PDF to be as well hidden as it could be (visually and by accesibility). And it has to be in a place which would be same for any created document (somewhere on the top, or on the very bottom of the document, or to the part of the document which isn't displayed at all... I'm really guessing here, I'm not really familliar with PDF file format)...
Any ideas?
P.S. It's not really important which image is it, I could be a e.g. completly transparent image, 1x1 pixels.
I'm not sure what you mean by Image, but you can "extend" the PDF reference.
A PDF consists of objects: PDF numbers, PDF names, PDF strings, PDF arrays, PDF dictionaries, PDF streams. What you probably want, is to add an entry to a dictionary (pick one: the root dictionary, the info dictionary, the root of the page tree,...) that isn't defined in the PDF reference, so that it isn't rendered in a PDF viewer.
The key of such an entry must be a PDF name. To avoid clashes with existing names (names that are part of a current PDF spec, or will be part of a future spec), it is advised to register a four-letter key with ISO. For instance, Adobe registered adbe, iText registered ITXT and use that name with an underscore. For instance, ITXT_OriginalData would be a good name if we needed the functionality you describe.
The value of such an entry will be a PDF stream. In iText, you need the PdfStream class for this.