What is the correct way to deep clone PDPage? - java

I am working with PDFBOX v2, I'm trying to clone the first PDPage of a PDDocument for keep it as template for new PDPages. That first page, has some acroform fields that I need fill.
I tried some methods but anyone makes I want to achieve.
1) Copy the first page content and add it to the document when I need a new page. That copy the page but the acroform field are linked with other pages fields, and if I modify field value from the first page, that shows in the other pages.
//Save in variable first page content
COSDictionary pageContent = (COSDictionary)doc.getPage(0).getCOSObject();
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
2) Clone the first page content and then add to the document like the first method. That copy the page but no field is copied :/
PDFCloneUtility cloner = new PDFCloneUtility(doc);
COSDictionary pageContent = (COSDictionary)cloner.cloneForNewDocument(doc.getPage(0).getCOSObject());
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
Then, what is the correct way to make a deep copy of a PDPage getting acroform fields independent from the first page?
Thanks!

I got the solution!
1) Start with an empty pdf template, only has 1 page. Open template document, fill common data and save as byte[] in memory.
PDDocument templatedoc = PDDocument.load(new File(path));
PDDocumentCatalog catalog = templatedoc.getDocumentCatalog();
PDAcroFrom acroForm = catalog.getAcroForm());
... fill acroForm common data of all pages ...
ByteArrayOutputStream basicTemplate = new ByteArrayOutputStream();
templatedoc.save(basicTemplate);
byte[] filledBasicTemplate = basicTemplate.toByteArray();
2) Generate new document for each needed page.
List<PDDocument> documents = new ArrayList<PDDocument>();
PDDocument activeDoc;
for(int i = 0; i < 5; i++) {
activeDoc = PDDocument.load(filledBasicTemplate);
documents.add(activeDoc);
... fill acroform or you need in each page ...
}
3) Import all new document first pages into final document and save final document.
PDDocument finalDoc = new PDDocument();
for(PDDocument currentDoc : documents) {
... fill common things like page numbers ...
finalDoc.importPage(currentDoc.getPage(0));
}
finalDoc.save(new File(path));
... close all documents after save the final document ...
It maybe not be the most optimized code, but it works.

Related

How do you add multiple images to a PDF with itext7 Java?

First google result takes me to Add multiple images into a single pdf file with iText using java which was posted 5 years ago. I am not sure which version they are using, because the Image object doesn't even have the getInstance method for me. Needless to say I am not getting much help from that link.
Anyways I am trying to create a javaFX application that loops multiple JPG images to create a single PDF document. Below is my code, which successfully creates a PDF from 2 images, but I am having trouble making the second image display on the second page.
In the link I posted above the simple solution I saw was to do document.newPage() then do document.add(img), but my document object doesn't have that method? I am not sure what to do.
PdfWriter writer = new PdfWriter("D:/sample1.pdf");
// Creating a PdfDocument
PdfDocument pdfDoc = new PdfDocument(writer);
// Adding a new page
// I can add multiple pages here, but when I add multiple images they do not
// automatically flow over to the next page.
pdfDoc.addNewPage();
pdfDoc.addNewPage();
// Creating a Document
Document document = new Document(pdfDoc);
String imageFile = "C:/Users/***/Downloads/MAT204/1.3-1.4 HW/test.jpg";
ImageData data = ImageDataFactory.create(imageFile);
Image img = new Image(data);
img.setAutoScale(true);
img.setRotationAngle(-Math.toRadians(90));
// I can add multiple images, but they overlaps each other and only
// appears on the first page.
// Is there a way for me to change the current page to write on?
document.add(img);
document.add(img);
// Closing the document
document.close();
System.out.println("PDF Created");
Anyways, I just want to figure out how to manually add another image before I write a loop to automate the process.
After doing more research I found the answer here.
https://kb.itextpdf.com/home/it7kb/examples/multiple-images
protected void manipulatePdf(String dest) throws Exception {
Image image = new Image(ImageDataFactory.create(IMAGES[0]));
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest));
Document doc = new Document(pdfDoc, new PageSize(image.getImageWidth(), image.getImageHeight()));
for (int i = 0; i < IMAGES.length; i++) {
image = new Image(ImageDataFactory.create(IMAGES[i]));
pdfDoc.addNewPage(new PageSize(image.getImageWidth(), image.getImageHeight()));
image.setFixedPosition(i + 1, 0, 0);
doc.add(image);
}
doc.close();
}

Adding Top level Bookmark to the exsting PDF using PDFBOX

I would like to add a top level bookmark to an existing pdf file using PDFBOX in JAVA.
Not sure why the following code was not working, can anyone help me out? Thanks.
Below is how the Document.pdf looks like in the bookmark section.
Top
---Node-1
-------Node-11
-------Node-12
....
---Node-2
-------Node-21
....
Java code (Part within the program) :
PDDocument document = PDDocument.load(new File("C:/Users/Desktop/document.pdf"))
PDDocumentOutline documentOutline = new PDDocumentOutline();
document.getDocumentCatalog().setDocumentOutline(documentOutline);
PDOutlineItem pagesOutline = new PDOutlineItem();
pagesOutline.setTitle("All Pages");
documentOutline.addFirst(pagesOutline);
pagesOutline.openNode();
documentOutline.openNode();
document.getDocumentCatalog().setPageMode(PageMode.USE_OUTLINES);
document.save("C:/Users/Desktop/document.pdf");
document.close()
Here's my attempt at this, I'm keeping my filename if asked questions later.
What I did is to wrap the old outline into a new item. It is not possible to add the existing items one by one, because only "orphans" can be added.
PDDocument document = PDDocument.load(new File("000009.pdf"));
PDDocumentOutline oldDocumentOutline = document.getDocumentCatalog().getDocumentOutline();
PDDocumentOutline documentOutline = new PDDocumentOutline();
document.getDocumentCatalog().setDocumentOutline(documentOutline);
PDOutlineItem pagesOutline = new PDOutlineItem();
//pagesOutline.setTitle("All Pages");
//documentOutline.addFirst(pagesOutline);
PDOutlineItem oldOutlineItemWrapped = new PDOutlineItem(oldDocumentOutline.getCOSObject());
oldOutlineItemWrapped.setTitle("All Pages");
documentOutline.addFirst(oldOutlineItemWrapped);
//pagesOutline.openNode();
oldOutlineItemWrapped.openNode();
documentOutline.openNode();
document.getDocumentCatalog().setPageMode(PageMode.USE_OUTLINES);
document.save("000009-modified.pdf");
document.close();

iText: Importing styled Text and informations from an existing PDF

I´m generating PDFs using iText and it works fine. But I need a way to import html styled informations from an existing PDF at some point.
I know i could just use the XMLWorker class to generate the text directly from html in my own document. But cause I´m not sure whether it actually supports all html features I´m looking to work around this.
Therefore a PDF is generated from html using XSLT. The content of this PDF then should be copied to my document.
There are two ways discribed in the book ("iText in Action").
One that parses the PDF and gets you the text (or other informations) from the document using PdfReaderContentParser and TextExtractionStrategy.
It looks like this:
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i++){
strategy = parser.processContent(i, new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}
But this only prints plain text to the document. Obviously there are more ExtractionStrategys and maybe one of them does exactly what i want but i couldn´t find it yet.
The second way is to copy an itextpdf.text.Image of each side of the PDF to your document. This is obviously not a good idea, cause it will add the entire page to your document even if there is only one line of text in the existing PDF. Its done like this:
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
PdfReader reader = new PdfReader(pdf);
PdfImportedPage page;
for(int i=1;i<=reader.getNumberOfPages();i++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}
Like I said this copys all the empty lines at the end of the PDF aswell, but i need to continue my text immediatly after the last line of text.
If I could convert this itext.text.Image into a java.awt.BufferedImage I could use getSubImage(); and informations i can extract from the PDF to cut away all the empty lines. But i wasn´t able to find a way to to this.
This are the two ways i found. But cause none of them is suitable for my purpose as they are my question is:
Is there a way to import everything except the empty lines at the end, but including text-style informations, tables and everything else from a PDF to my document using iText?
You can trim away empty space of the XSLT generated PDF and then import the trimmed pages as in your code.
Example code
The following code borrows from the code in my answer to Using iTextPDF to trim a page's whitespace. In contrast to the code there, though, we have to manipulate the media box, not the crop box, because this is the only box respected by PdfWriter.getImportedPage.
Before importing a page from a given PdfReader, crop it using this method:
static void cropPdf(PdfReader reader) throws IOException
{
int n = reader.getNumberOfPages();
for (int i = 1; i <= n; i++)
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MarginFinder finder = parser.processContent(i, new MarginFinder());
Rectangle rect = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
PdfDictionary page = reader.getPageN(i);
page.put(PdfName.MEDIABOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
}
}
(excerpt from ImportPageWithoutFreeSpace.java)
The extended render listener MarginFinder is taken as is from the question linked to above. You can find a copy here: MarginFinder.java.
Example run
Using this code
PdfReader readerText = new PdfReader(docText);
cropPdf(readerText);
PdfReader readerGraphics = new PdfReader(docGraphics);
cropPdf(readerGraphics);
try ( FileOutputStream fos = new FileOutputStream(new File(RESULT_FOLDER, "importPages.pdf")))
{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fos);
document.open();
document.add(new Paragraph("Let's import 'textOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
document.add(Image.getInstance(writer.getImportedPage(readerText, 1)));
document.add(new Paragraph("and now 'graphicsOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
document.add(Image.getInstance(writer.getImportedPage(readerGraphics, 1)));
document.add(new Paragraph("That's all, folks!", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
document.close();
}
finally
{
readerText.close();
readerGraphics.close();
}
(excerpt from unit test method testImportPages in ImportPageWithoutFreeSpace.java)
I imported both the page from the docText document
and the page from the docGraphics document
into a new document with some text before, between, and after. The result:
As you can see, source styles are preserved but free space around is discarded.

Is it possible to mark a string in pdf document?

I was wondering if it was possible to mark strings in pdf with different color or underline them while looping through the pdf document ?
It's possible on creating a document. Just use different chunks to set the style. Here's an example:
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.add(new Chunk("This word is "));
Chunk underlined = new Chunk("underlined");
underlined.setUnderline(1.0f, -1.0f); //We can customize thickness and position of underline
document.add(underlined);
document.add(new Chunk(". And this phrase has "));
Chunk background = new Chunk("yellow background.");
background.setBackground(BaseColor.YELLOW);
document.add(background);
document.add(Chunk.NEWLINE);
document.close();
However, it's almost impossible to edit an existing PDF document. The author of iText writes in his book:
In a PDF document, every character or glyph on a PDF page has its
fixed position, regardless of the application that’s used to view the
document. This is an advantage, but it also comes with a disadvantage.
Suppose you want to replace the word “edit” with the word “manipulate”
in a sentence, you’d have to reflow the text. You’d have to reposition
all the characters that follow that word. Maybe you’d even have to
move a portion of the text to the next page. That’s not trivial, if
not impossible.
If you want to “edit” a PDF, it’s advised that you change the original
source of the document and remake the PDF.
Aspose.PDF APIs support to create new PDF document and manipulate existing PDF documents without Adobe Acrobat dependency. You can search and add Highlight Annotation to mark PDF text.
REST API Solution using Aspose.PDF Cloud SDK for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf-cloud/aspose-pdf-cloud-java
String name = "02_pages.pdf";
String folder="Temp";
String remotePath=folder+"/"+name;
// File to upload
File file = new File("C:/Temp/"+name);
// Storage name is default storage
String storage = null;
// Get App Key and App SID from https://dashboard.aspose.cloud/
PdfApi pdfApi = new PdfApi("xxxxxxxxxxxxxxxxxxxx", "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx");
//Upload file to cloud storage
pdfApi.uploadFile(remotePath,file,storage);
//Text Position
Rectangle rect = new Rectangle().LLX(259.27580539703365).LLY(743.4707997894287).URX(332.26148873138425).URY(765.5148007965088);
List<AnnotationFlags> flags = new ArrayList<>();
flags.add(AnnotationFlags.DEFAULT);
HighlightAnnotation annotation = new HighlightAnnotation();
annotation.setName("Name Updated");
annotation.rect(rect);
annotation.setFlags(flags);
annotation.setHorizontalAlignment(HorizontalAlignment.CENTER);
annotation.setRichText("Rich Text Updated");
annotation.setSubject("Subj Updated");
annotation.setPageIndex(1);
annotation.setZindex(1);
annotation.setTitle("Title Updated");
annotation.setModified("02/02/2018 00:00:00.000 AM");
List<HighlightAnnotation> annotations = new ArrayList<>();
annotations.add(annotation);
//Add Highlight Annotation to the PDF document
AsposeResponse response = pdfApi.postPageHighlightAnnotations(name,1, annotations, storage, folder);
//Download annotated PDF file from Cloud Storage
File downloadResponse = pdfApi.downloadFile(remotePath, null, null);
File dest = new File("C:/Temp/HighlightAnnotation.pdf");
Files.copy(downloadResponse.toPath(), dest.toPath(), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
System.out.println("Completed......");
On-Premise Solution using Aspose.PDF for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// Instantiate Document object
Document document = new Document("C:/Temp/Test.pdf");
// Create TextFragment Absorber instance to search particular text fragment
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Estoque");
// Iterate through pages of PDF document
for (int i = 1; i <= document.getPages().size(); i++) {
// Get first page of PDF document
Page page = document.getPages().get_Item(i);
page.accept(textFragmentAbsorber);
}
// Create a collection of Absorbed text
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Iterate on above collection
for (int j = 1; j <= textFragmentCollection.size(); j++) {
TextFragment textFragment = textFragmentCollection.get_Item(j);
// Get rectangular dimensions of TextFragment object
Rectangle rect = new Rectangle((float) textFragment.getPosition().getXIndent(), (float) textFragment.getPosition().getYIndent(), (float) textFragment.getPosition().getXIndent() + (float) textFragment.getRectangle().getWidth(), (float) textFragment.getPosition().getYIndent() + (float) textFragment.getRectangle().getHeight());
// Instantiate HighLight Annotation instance
HighlightAnnotation highLight = new HighlightAnnotation(textFragment.getPage(), rect);
// Set opacity for annotation
highLight.setOpacity(.80);
// Set the color of annotation
highLight.setColor(Color.getYellow());
// Add annotation to annotations collection of TextFragment
textFragment.getPage().getAnnotations().add(highLight);
}
// Save updated document
document.save("C:/Temp/HighLight.pdf");
P.S: I work as support/evangelist developer at Aspose.

Reading a particular page from a PDF document using PDFBox

How do I read a particular page (given a page number) from a PDF document using PDFBox?
This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );
//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}
Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}
you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);
Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();
Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

Categories