how to insert text into a scanned pdf document using java

how to insert text into a scanned pdf document using java - java

I have to add text to pdf documents where there are many scanned pdf documents so the inserted text is inserted back to the scanned image and not over the image. how to add text over the scanned image inside the pdf.
package editExistingPDF;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import jxl.Cell;
import jxl.Sheet;
import jxl.Workbook;
import jxl.read.biff.BiffException;
import org.apache.commons.io.FilenameUtils;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class AddPragraphToPdf {
public static void main(String[] args) throws IOException, DocumentException, BiffException {
String tan = "no tan";
File inputWorkbook = new File("lars.xls");
Workbook w;
w = Workbook.getWorkbook(inputWorkbook);
// Get the first sheet
Sheet sheet = w.getSheet(0);
Cell[] tnas =sheet.getColumn(0);
File ArticleFolder = new File("C:\\Documents and Settings\\sathishkumarkk\\My Documents\\article");
File[] listOfArticles = ArticleFolder.listFiles();
for (int ArticleInList = 0; ArticleInList < listOfArticles.length; ArticleInList++)
{
Document document = new Document(PageSize.A4);
// System.out.println(listOfArticles[ArticleInList].toString());
PdfReader pdfArticle = new PdfReader(listOfArticles[ArticleInList].toString());
if(listOfArticles[ArticleInList].getName().contains(".si."))
{continue;}
int noPgs=pdfArticle.getNumberOfPages();
String ArticleNoWithOutExt = FilenameUtils.removeExtension(listOfArticles[ArticleInList].getName());
String TanNo=ArticleNoWithOutExt.substring(0,ArticleNoWithOutExt.indexOf('.'));
// Create output PDF
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream("C:\\Documents and Settings\\sathishkumarkk\\My Documents\\toPrint\\"+ArticleNoWithOutExt+".pdf"));
document.open();
PdfContentByte cb = writer.getDirectContent();
//get tan form excel sheet
System.out.println(TanNo);
for(Cell content : tnas){
if(content.getContents().contains(TanNo)){
tan=content.getContents();
System.out.println(tan);
}else{
continue;
}
}
// Load existing PDF
//PdfReader reader = new PdfReader(new FileInputStream("1.pdf"));
for (int i = 1; i <= noPgs; i++) {
PdfImportedPage page = writer.getImportedPage(pdfArticle, i);
// Copy first page of existing PDF into output PDF
document.newPage();
cb.addTemplate(page, 0, 0);
// Add your TAN here
Paragraph p= new Paragraph(tan);
Font font = new Font();
font.setSize(1.0f);
p.setLeading(12.0f, 1.0f);
p.setFont(font);
document.add(p);
}
document.close();
}
}
}
NOTE: The problem is that when there is a pdf create with only text I have no problem but when a pdf is with full of scanned document and when I try to add text; it gets added to the back of the scanned document. so while I print those pdf I will not get those text I added.

From this iText Example (which is the reverse of what you want, but switch getUnderContent with getOverContent and you'll be fine) :
Blockquote
Each PDF page has two extra layers; one that sits on top of all text / graphics and one that goes to the bottom. All user added content gets in-between these two. If we get into this bottommost content, we can write anything under that we want. To get into this bottommost layer, we can use the " getUnderContent" method of PdfStamper object.
This is documented in iText API Reference as shown below:
public PdfContentByte getUnderContent(int pageNum)
Gets a PdfContentByte to write under the page of the original document.
Parameters:
pageNum - the page number where the extra content is written
Returns:
a PdfContentByte to write under the page of the original document

To do this, you will need to first read in the PDF document, extract the elements and then add text to the document and resave it as a PDF document. This of course assumes that you can read the PDF document in the first place.
I'd recommend iText (see Example Code iText) to help you do this.

Related

Split a PDF page in two parts [duplicate]

This question already has an answer here:
itext Split PDF Vertically
(1 answer)
Closed 6 years ago.
I would like to take a single-page PDF, and than split it in two parts (cutting that page in the middle), without considering the text on that page. I'm using iText, but I don't find any examples on how to do this.

You cannot really split a page, it would be a quite difficult task, what you can do is to clone content of a page inside a new one with half its original size, and repeat for the second page applying a translation to the content.
I show an example with PDFBox , I'm using it lately and I had a sandbox project ready to do the test, surely you can do the same with iText.
package printit;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
public class CutIt {
public static void main(String[] args) throws IOException {
PDDocument outdoc = new PDDocument();
PDDocument doc = PDDocument.load(new File("sample_1.pdf"));
PDPage page = (PDPage) doc.getDocumentCatalog().getPages().get(0);
PDRectangle cropBox = page.getCropBox();
float upperRightY = cropBox.getUpperRightY();
float lowerLeftY = cropBox.getLowerLeftY();
cropBox.setLowerLeftY(upperRightY/2);
page.setCropBox(cropBox);
outdoc.importPage(page);
cropBox = page.getCropBox();
cropBox.setUpperRightY(upperRightY/2);
cropBox.setLowerLeftY(lowerLeftY);
page.setCropBox(cropBox);
outdoc.importPage(page);
outdoc.save("cut.pdf");
outdoc.close();
doc.close();
}
}

Merge pdf documents of different width using iText

I am having problem while merging documents of different width using iText.
Below is the code I'm using to merge.
public static void doMerge(List<InputStream> list, OutputStream outputStream) throws Exception {
Rectangle pagesize = new Rectangle(1700f, 20f);
com.itextpdf.text.Document document = new com.itextpdf.text.Document(pagesize);
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
document.setPageSize(pagesize);
com.itextpdf.text.pdf.PdfContentByte cb = writer.getDirectContent();
for (InputStream in : list){
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++){
document.newPage();
//import the page from source pdf
com.itextpdf.text.pdf.PdfImportedPage page = writer.getImportedPage(reader, i);
//calculate the y for merging it from top
float y = document.getPageSize().getHeight() - page.getHeight();
//add the page to the destination pdf
cb.addTemplate(page, 0, y);
}
reader.close();
in.close();
}
outputStream.flush();
document.close();
outputStream.close();
}
First page of pdf will be of 14 inch of width and 13 inch of height. All the other pages on document will be always smaller than it.
I want to merge all the documents altogether in a single document.
I don't know how to set a width and height of a single merged document.
I think Rectangle pagesize = new Rectangle(1700f, 20f); should do it but it's not working means if change it to Rectangle pagesize = new Rectangle(1700f, 200f);, document has no effect.
Please guide me further.

Using the PdfWriter class to merge documents goes against all the recommendations given in the official documentation, though there are unofficial examples that may have lured you into writing bad code. I hope that you understand that I find these bad examples even more annoying than you do.
Please take a look at Table 6.1 in chapter 6 of my book. It gives you an overview showing when to use which class. In this case, you should use PdfCopy:
String[] files = { MovieLinks1.RESULT, MovieHistory.RESULT };
// step 1
Document document = new Document();
// step 2
PdfCopy copy = new PdfCopy(document, new FileOutputStream(RESULT));
// step 3
document.open();
// step 4
PdfReader reader;
int n;
// loop over the documents you want to concatenate
for (int i = 0; i < files.length; i++) {
reader = new PdfReader(files[i]);
// loop over the pages in that document
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
}
// step 5
document.close();
If you are using a recent version of iText, you can even use the addDocument() method in which case you don't need to loop over all the pages. You also need to take special care if forms are involved. There are several examples demonstrating the new functionality (dating from after the book was written) in the Sandbox.

with the itext version 5.5 we can merge pdf more easly using the method PdfCopy.addDocument :
package tn.com.sf.za.rd.controller;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfReader;
public class ReportMerging {
public static void main(String[] args) throws DocumentException, IOException {
String DOC_ONE_PATH = "D:/s.zaghdoudi/tmp/one.pdf";
String DOC_TWO_PATH = "D:/s.zaghdoudi/tmp/two.pdf";
String DOC_THREE_PATH = "D:/s.zaghdoudi/tmp/three.pdf";
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(DOC_THREE_PATH));
document.open();
PdfReader readerOne = new PdfReader(DOC_ONE_PATH);
PdfReader readerTwo = new PdfReader(DOC_TWO_PATH);
copy.addDocument(readerOne);
copy.addDocument(readerTwo);
document.close();
}
}

extract image from image

Is it possible to extract an image from a jpeg, png or tiff file? NOT PDF! Suppose I have a file containing both text and images in jpeg format (so it's basically a picture); I want to be able to extract the image only programmatically (preferably using Java). If anyone knows useful libraries please let me know. I have already tried AspriseOCR and tesseract-ocr, they have been successful at extracting text only (obviously).
Thank you.

Try :
int startProintX = xxx;
int startProintY = xxx;
int endProintX = xxx;
int endProintY = xxx;
BufferedImage image = ImageIO.read(new File("D:/temp/test.jpg"));
BufferedImage out = image.getSubimage(startProintX, startProintY, endProintX, endProintY);
ImageIO.write(out, "jpg", new File("D:/temp/result.jpg"));
These point are region of image you want to extract.
Extract image from pdf file
I suggest to change your post tile. You can use pdfbox or iText api. The below example to extract the all of the image from pdf file.
There might be some resource for you. If there are a lot of image in pdf, may be occur java.lang.OutOfMemoryError.
Download pdfbox.xx.jar here.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.PDFBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
import org.jdom.Document;
public class ExtractImagesFromPDF {
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("D:/temp/test.pdf"));
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while(iter.hasNext()) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = resources.getImages();
if( images != null ) {
Iterator imageIter = images.keySet().iterator();
while(imageIter.hasNext()) {
String key = (String)imageIter.next();
System.out.println("Key : " + key);
PDXObjectImage image = (PDXObjectImage)images.get(key);
File file = new File("D:/temp/" + key + "." + image.getSuffix());
image.write2file(file);
}
}
}
}
}
Extract specific image from pdf file
To extract specific image, you have to know index of page and index of image of that page. Otherwise, you cannot extract.
The following example program extract first image of first page.
int targetPage = 0;
PDPage firstPage = (PDPage)document.getDocumentCatalog().getAllPages().get(targetPage);
PDResources resources = firstPage.getResources();
Map images = resources.getImages();
int targetImage = 0;
String imageKey = "Im" + targetImage;
PDXObjectImage image = (PDXObjectImage)images.get(imageKey);
File file = new File("D:/temp/" + imageKey + "." + image.getSuffix());
image.write2file(file);

If you are interested in an out-of-box product that could do this via black-box processing with minimal non-programming configuration (since you tried other products), then ABBYY FlexiCapture can do it. It can be configured to look for dynamic sizes of pictures/objects in loosely defined areas, or anywhere on the page, with full control over search logic. I used it once to extract lines of specific shape and thickness to separate chapters of a book, where each line indicated a new chapter, and could be anywhere on the page.

Itext PDF shows blank page

I have the below sample code which generates a PDF file using iText.
The question I have is when I create the base64Binary by DatatypeConverter.printBase64Binary method..
I tried to copy the Sysem.out.println of "base64Binary".
Used a online base64 online decoder tool to decode the content and save it output as sample.pdf and
when I try to open the sample.pdf it shows empty. I am not sure why its behaving this way and help would be much appreciated.
But when I directly decode using java and write it to a disk file it shows the context.
Can someone help me understand why it shows blank when I try to save "base64Binary" output as sample.pdf.
Thanks.
Below is the code snippet:
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import javax.xml.bind.DatatypeConverter;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.Font;
import com.itextpdf.text.Font.FontFamily;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfWriter;
/**
* Creates a PDF file in memory.
*/
public class HelloWorldMemory {
/** Path to the resulting PDF file. */
public static final String RESULT = "C:////hello_memory.pdf";
public static void main(final String[] args) throws DocumentException, IOException {
// step 1
final Document document = new Document();
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
final PdfWriter writer = PdfWriter.getInstance(document, baos);
document.open();
final PdfContentByte cb = writer.getDirectContent();
cb.beginText();
cb.setFontAndSize(getBaseFont(Font.NORMAL), 24);
final float exPosition = (PageSize.A4.getWidth()) / 2;
cb.showTextAligned(Element.ALIGN_CENTER, "Test No", exPosition, 670, 0);
cb.endText();
document.add(new Paragraph("Hello World!"));
document.close();
System.out.println("baos.toByteArray():" + baos.toByteArray());
final String base64Binary = DatatypeConverter.printBase64Binary(baos.toByteArray());
System.out.println("base64Binary:" + base64Binary);
final byte[] txt = DatatypeConverter.parseBase64Binary(base64Binary);
final FileOutputStream fos = new FileOutputStream(RESULT);
fos.write(txt);
fos.close();
}
private static BaseFont getBaseFont(final int fontType) {
final Font f = new Font(FontFamily.HELVETICA, 0, fontType);
final BaseFont baseFont = f.getCalculatedBaseFont(true);
return baseFont;
}
}

This question is not really related to iText or PDF. You'll have the same problem with any binary data that is base64 encoded. When using the online base64 decoder, your binary data gets corrupted somehow. Bruno already explained in his answer why this does not completely invalidate a file in case of a PDF.
The data is probably corrupted because of encoding issues. Maybe the online base64 decoder displayed the decoded data in a textarea or something and you copy/pasted it into a file? If you use a decoder that offers you a binary file for download, the result should be fine.
I tested with http://www.opinionatedgeek.com/dotnet/tools/base64decode/ (the first hit of a Google search). When I save the .bin file and rename it to .pdf, it displays as expected in a PDF viewer.

PDF is a binary file format based on the Carousel Object System (COS) syntax and the AIM (Adobe Imaging Model). The COS objects use ASCII for the file structures, but the streams for images and AIM syntax are usually binary. When you copy a PDF file without respecting the binary aspect of the file, a PDF viewer can render the document structure (the pages) based on the ASCII COS objects, but not the content (what's on the pages). This is probably what happens in your case: you're corrupting the bytes in the content streams.

How to resize a PdfPTable to fit the page?

I am generating a document, like in the following code, except ofcourse the contents of the table, which are varying. What I need to do is make sure that this table never exceeds one page in size, regardless of the amount of content in the cells. Is there a way to do it ?
import com.itextpdf.text.Phrase;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.pdf.PdfPCell;
import com.itextpdf.text.pdf.PdfPTable;
import java.awt.Desktop;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public void createTemplate() throws DocumentException, FileNotFoundException, IOException{
String TARGET = System.getProperty("user.home")+"\temp.pdf";
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(TARGET));
document.open();
PdfPTable table = new PdfPTable(7);
for (int i = 0; i < 105; i++) {
Phrase p = new Phrase("some text");
PdfPCell cell = new PdfPCell();
cell.addElement(p);
table.addCell(cell);
}
table.setTotalWidth(PageSize.A4.getWidth()-10);
table.setLockedWidth(true);
PdfContentByte canvas = writer.getDirectContent();
PdfTemplate template = canvas.createTemplate(table.getTotalWidth(),table.getTotalHeight());
table.writeSelectedRows(0, -1, 0, PageSize.A4.getHeight(), template);
Image img = Image.getInstance(template);
img.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
img.setAbsolutePosition(0, PageSize.A4.getHeight());
document.add(img);
document.close();
Desktop desktop = Desktop.getDesktop();
File file = new File(TARGET);
desktop.open(file); }
Edit: #Bruno Lowagie. The hint with the template wrapped in an image sounds just right to me, and I changed the code according, but all I get now is an empty PDF. Am I doing something wrong, or is this the wrong approach alltogether?

If you want a table to fit a page, you should create the table before even thinking about page size and ask the table for its height as is done in the TableHeight example. Note that the getTotalHeight() method returns 0 unless you define the width of the table. This can be done like this:
table.setTotalWidth(width);
table.setLockedWidth(true);
Now you can create a Document with size Rectangle(0, 0, width + margin * 2, getTotalHeight() + margin * 2) and the table should fit the document exactly when you add it with the writeSelectedRows() method.
If you don't want a custom page size, then you need to create a PdfTemplate with the size of the table and add the table to this template object. Then wrap the template object in an Image and use scaleToFit() to size the table down.
public static void main(String[] args) throws DocumentException, FileNotFoundException, IOException {
String TARGET = "temp.pdf";
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(TARGET));
document.open();
PdfPTable table = new PdfPTable(7);
for (int i = 0; i < 700; i++) {
Phrase p = new Phrase("some text");
PdfPCell cell = new PdfPCell();
cell.addElement(p);
table.addCell(cell);
}
table.setTotalWidth(PageSize.A4.getWidth());
table.setLockedWidth(true);
PdfContentByte canvas = writer.getDirectContent();
PdfTemplate template = canvas.createTemplate(table.getTotalWidth(), table.getTotalHeight());
table.writeSelectedRows(0, -1, 0, table.getTotalHeight(), template);
Image img = Image.getInstance(template);
img.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
img.setAbsolutePosition(0, (PageSize.A4.getHeight() - table.getTotalHeight()) / 2);
document.add(img);
document.close();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to insert text into a scanned pdf document using java - java

To do this, you will need to first read in the PDF document, extract the elements and then add text to the document and resave it as a PDF document. This of course assumes that you can read the PDF document in the first place. I'd recommend iText (see Example Code iText) to help you do this.

Related

Split a PDF page in two parts [duplicate]

Merge pdf documents of different width using iText

extract image from image

Itext PDF shows blank page

How to resize a PdfPTable to fit the page?

Categories

Resources