I need to get the number of lines in pdf file using java.
I used itext-2.1.7.jar to get the page count.
Is there any way to get the count of lines in pdf.
There's no easy way to do this, only aproximations. That problem is that a pdf page is a canvas with drawings at arbitrary locations and some of them happen to be fonts and text.
An approach is to extract the text and from the text location build a list of what you will consider a line. Use LocationTextExtractionStrategy to get this result but you'll have to use the more recent jar, itext 2.1.7 is too old and doesn't work that well with text extraction.
Related
I want to blur sensitive information in pdf file. I read about pyPdf in python and PDFBox of java but I could not get how to search and replace text in pdf file. By replacing I mean blur or even asterik character.
I also thought of a step in which I can take image of very page of pdf and then show them in html one by one. But then the same problem is there how to replace text in those images?
I was going through the itext api docs & I was able create a pdf with a watermark image or text but did not find a method to get/extract watermark content from pdf.
So I have a pdf document containing watermarked text/image & I want to extract that text or img and validate which I am not able to do.
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
By validate I mean if I have an existing pdf/image with some watermarked text [as done in 2nd link in above ref], I want to check whether it has expected text/image.
References:
http://itextpdf.com/themes/keyword.php?id=226
http://www.java-connect.com/itext/add-watermark-in-PDF-document-using-java-iText-library.html
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
Extracting watermark content?
There is nothing special about watermarks in PDFs in contrast to regular page content. They merely
appear pretty early in the content stream and other content later in the stream, therefore, is drawn above it; or they
appear pretty late in the content stream but have some kind of transparency applied.
Actually there is another type of watermarks which is special, the so-called Watermark Annotations. As these annotation can easily be lost when documents are merged or otherwise manipulated, though, they hardly ever are used.
Furthermore different PDF generating software suites offering a way to add watermarks do so in their respective individual way. Thus, you cannot even recognize watermarks by some special operations done in some specific unique pattern.
Already the iText examples you referred to apply different kinds of watermarks
MovieCountries2 simply draws some gray large Text using an angled base line.
StampStationery copies a complete page from some PDF (which itself may visually have foreground and background material) into a separate object inside the target PDF and adds a reference to this object at the beginning of every page of the target.
InsertPages similarly references a page from some PDF on every newly generated target document page.
Thus, blind watermark extraction is virtually impossible.
Validating watermark content!
You might try some validation, though, if you know what you are searching for. You simply do not merely search some (in PDF not existing) fixed watermark stream but instead the whole page content.
iText offers the classes of the parser package which allow extraction of text and/or bitmap images from content streams. Look at the samples referenced from the keywords PARSING PDF > EXTRACTING IMAGES and PARSING PDF > EXTRACTING TEXT.
You merely have to check whether the image or text which you expect can be found by these classes positioned and styled as you expect.
I have a library which generates pdf document with images.
I want to be able to add text after each image. What is the syntax for that? How to insert text into pdf documents?
I have to use the library I have, not another one.
First of all, mkl is correct, have a look at the specification for all of the details. PDF is an exact language, if you make mistakes they will routinely be punished severely once you open the PDF in viewers.
Secondly, when you think about putting text on the page, don't forget that besides the text operators to draw the text on the page, you'll also have to specify the font to use to draw this text. Which will include making sure there is a font resource included in the PDF file if your library doesn't automatically handle all of that for you.
If you want to cut corners (I shiver while writing this) and perhaps don't read the specification as thoroughly, try this.
1) Create a PDF file that looks more or less like what you want.
2) Use a tool such as pdfToolbox from callas (http://www.callassoftware.com/callas/doku.php/en:download) or Browser from Enfocus (http://www.enfocus.com/en/products/browser). Both of these tools allow you to investigate the low-level structure of a PDF file, including looking at the actual page description code. This will show you how fonts are embedded (if you have to do it yourself that could be very handy) and how text is rendered on the page (and how you set the font, size, color etc... to use).
I have an word template. There is an word photo that has to be replaced with an image. This has to be done with Docx4Java.
How do I do this?
If specifically looking to replace a text with an image(which is not possible using docx4j as answered above), you can use replace bookmark with image as an alternative.
Just open your templated word file, position the cursor at desired location and insert->bookmark and name your bookmark.
I followed the instructions here to replace this bookmark with an image
Disclosure: I manage the docx4j project
The VariableReplace code doesn't handle images.
The best way to do this would be to use data bound content controls, specifically a picture content control pointing via XPath at a base-64 encoded image in an XML document (see Getting Started for details).
However, if you want to replace a word with an image, you can do so, but you'll have to write a bit of glue code. It is pretty straightforward.
First, find the word. You can do this using XPath or TraversalUtil (again, see Getting Started for details).
Hopefully it is in a run (w:r/w:t) by itself. If not, you'll need to split the run up so you don't replace adjacent text.
Then, add the image. See the sample ImageAdd.
I suggest you have a look at the XML created when you add an image in Word (ie save and unzip your docx, then look at document.xml). Take care that the XML representing the image is at the correct level (eg child of w:p).
I'm trying to edit a existing PDF file. It is a file where I need to fill in some addresses and other stuff. I want to connect an addressbook to the application so the user can select a user from the addressbook and a part get filled in automatic.
My questions are:
Is it possible to edit a existing PDF file and fill in some fields (+/- 20), because I know there is Itext (http://www.itextpdf.com) but I read that the possiblities are very small.
Or can I better convert the PDF to JPG and get it as background. And create JLabels on the places where I need to fill in the fields. And then print the whole Frame on a A4.
Or are there better posibilities?
So what I need to do, step by step:
Select one of the PDF's (they are in the program)
Fill some fields with content/addresses
Print the PDF/Form with a printer
There is a toolkit given by Adobe named as Acrobat Forms Data Format (FDF) Toolkit which gives API for different languages to fill forms.
You can get the java code at the bottom of that page or check this link
We didn't edit existing PDF's but created totally new letters/reports/doco from our java app using iReport
You could use pdf form and edit field values programmatically using IText or Apache PDFBox (download pdfbox and see SetField.java example)