I have downloaded pdf file from one site, and on each page there is hyperlink to this site in a rectangle. I want to remove link from every page.
I am using PDFBox version 2.0.8
I figured out that link description is located in ANNOTS in every page of the document. I deleted ANOOTS corresponding to link. Of cause I set needToUpdated flag true to every node in the chain from the PDF catalog.
In debug mode I see that readOnly flag is set to true in AccessPermission object.
When I open edited pdf file all pages are empty and for every page Acrobat Reader shows the following error:
There was an error processing a page. Invalid Function resource.
I have several questions:
Can I programmatically change the pdf file when readOnly flag is set
to true in AccessPermission object?
Why I get error described above?
What do I need to do to remove unnecessary link from page and every
page display properly in pdf document?
Here is my code(sorry for quality this is only draft):
File book = new File(path_to_pdf_file);
PDDocument document = PDDocument.load(book);
document.setAllSecurityToBeRemoved(true);
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
dictionary.removeItem(COSName.PERMS);
dictionary.setNeedToBeUpdated(true);
((COSObject) document.getDocumentCatalog().getCOSObject().getItem(COSName.PAGES)).setNeedToBeUpdated(true);
dictionary = document.getDocumentCatalog().getPages().getCOSObject();
dictionary.setNeedToBeUpdated(true);
COSArray arr = (COSArray) dictionary.getDictionaryObject(COSName.KIDS);
arr.setNeedToBeUpdated(true);
COSArray arrayForLoop;
COSDictionary tempDic;
for (int k = 0; k < arr.size(); ++k) {
COSObject object = (COSObject) arr.get(k);
object.setNeedToBeUpdated(true);
dictionary = (COSDictionary) object.getObject();
dictionary.setNeedToBeUpdated(true);
arrayForLoop = (COSArray) dictionary.getItem(COSName.ANNOTS);
arrayForLoop.setNeedToBeUpdated(true);
arrayForLoop = (COSArray) arrayForLoop.getCOSObject();
arrayForLoop.setNeedToBeUpdated(true);
dictionary = (COSDictionary) arrayForLoop.get(0);
dictionary.setNeedToBeUpdated(true);
dictionary.removeItem(COSName.TYPE);
dictionary.removeItem(COSName.SUBTYPE);
dictionary.removeItem(COSName.RECT);
dictionary.removeItem(COSName.BORDER);
tempDic = (COSDictionary) dictionary.getItem(COSName.A);
tempDic.setNeedToBeUpdated(true);
dictionary.removeItem(COSName.A);
}
document.saveIncremental(new FileOutputStream(path_to_save_file));
document.close();
In code above I iterate over every page, delete ANNOTS that corresponding to
link. Also I used saveIncremental method to traverse all modified nodes from leaf to root.
Thank you for your answers.
Related
I have a pdf template which contains images and form fields.
I read this template and fill form fields per page and write to a temp pdf file. Then I read this file and copy to a master document to have multiple pages using the same template. Roughly as below:
Document masterDoc = ...
-- loop per page --
PdfWriter pfdWriter = new PdfWriter(tmpFileName);
PdfDocument pdf = new PdfDocument(new PdfReader(templateFile), pfdWriter);
Document doc = new Document(pdf);
// Set form fields
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
form.setDefaultJustification(0);
Map<String, PdfFormField> formFields = form.getFormFields();
formFields.get("key").setValue("value");
form.flattenFields();
doc.close();
try (PdfDocument resource = new PdfDocument(new PdfReader("pathToTmpFile"))) {
resource.copyPagesTo(1, 1, masterDoc.getPdfDocument());
}
-- end of loop
This approach is slow (depends on the template file size, but takes seconds not milliseconds).
Would it be possible to use same template file per every page without writing and reading from/to temp files?
I read the documentation and guess it might be possible with new page event handler but couldn't figure it out.
I am trying to use PDFBox to create a link i can click to go to another page in the same document.
From this question (How to use PDFBox to create a link that goes to *previous view*?) I see that this should be easy to do, but when i try to do it I get this error: Exception in thread "main" java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
I am using this code:
//Loading an existing document consisting of 3 empty pages.
File file = new File("C:\\Users\\Student\\Documents\\MyPDF\\Test_doc.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(1);
PDAnnotationLink link = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action = new PDActionGoTo();
destination.setPageNumber(2);
action.setDestination(destination);
link.setAction(action);
link.setPage(page);
I am using PDFBox 2.0.13, can anyone give me some guidance on what I'm doing wrong?
Appreciate all answers.
First of all, for a local link ("a link i can click to go to another page in the same document"), destination.setPageNumber is the wrong method to use, cf. its JavaDocs:
/**
* Set the page number for a remote destination. For an internal destination, call
* {#link #setPage(PDPage) setPage(PDPage page)}.
*
* #param pageNumber The page for a remote destination.
*/
public void setPageNumber( int pageNumber )
Thus, replace
destination.setPageNumber(2);
by
destination.setPage(document.getPage(2));
Furthermore, you forgot to set a rectangle area for the link and you forgot to add the link to the page annotations.
All together:
PDPage page = document.getPage(1);
PDAnnotationLink link = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action = new PDActionGoTo();
destination.setPage(document.getPage(2));
action.setDestination(destination);
link.setAction(action);
link.setPage(page);
link.setRectangle(page.getMediaBox());
page.getAnnotations().add(link);
(AddLink test testAddLinkToMwb_I_201711)
We have a vendor that will not accept PDFs that contain links. We are trying to remove the links by removing all link annotations from each page of the PDF using iText 7.1 (Java). We have tried multiple techniques based on research. Here are three examples of attempts to detect and remove the links. None of these result in the destination PDF (test-no-links.pdf) having the links removed. Any insight would be greatly appreciated.
Example 1: Remove based on class type of annotation
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
if( annot instanceof PdfLinkAnnotation ) {
pdfPage.removeAnnotation(annot);
}
}
}
}
pdfDoc.close();
Example 2: Remove based on annotation subtype value
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
// if this annotation has a link, delete it
if ( annot.getSubtype().equals(PdfName.Link) ) {
PdfDictionary annotAction = ((PdfLinkAnnotation)annot).getAction();
if( annotAction.get(PdfName.S).equals(PdfName.URI) ||
annotAction.get(PdfName.S).equals(PdfName.GoToR) ) {
PdfString uri = annotAction.getAsString(PdfName.URI);
System.out.println("Removing " + uri.toString());
pdfPage.removeAnnotation(annot);
}
}
}
}
}
pdfDoc.close();
Example 3: Remove all annotations (ignore annotation type)
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
// remove all annotations from the page regardless of type
pdfPage.getPdfObject().remove(PdfName.Annots);
}
pdfDoc.close();
Each of your tests generates a PDF without Link annotations.
Probably, though, your PDF viewer recognizes "www.qualpay.com" as (partial) URL and displays it as a link.
In detail
Your routines
All your tests successfully remove all Link annotations from your sample PDF, cf. these screen shots for the source and all three result files, in particular look for the page 1 Annots entry:
test-with-links.pdf
test-no-links.pdf
test-no-links-1.pdf
test-no-links-2.pdf
The viewer
Indeed, though, when viewing the PDF in Adobe Acrobat Reader (and also some other viewers, e.g. the built-in PDF viewers of Chrome and Edge), you'll see that "www.qualpay.com" is treated like a link.
The cause is that this is a feature of the PDF viewer! It scans the text of the PDF it displays for strings it recognizes as (a part of) some URL and displays them like links!
In Adobe Acrobat Reader you can disable this feature:
If you disable "Create links from URLs", you'll suddenly find the URLs in your result files inactive while the URL in your source file (with the link annotation) is still active.
What to do
We have a vendor that will not accept PDFs that contain links.
First discuss with your vendor what exactly he means by "PDFs that contain links". Does he mean
PDFs with Link annotations or
PDFs with URLs that common PDF viewers present like Link annotations.
In the former case you're done, your code (either variant) removes the link annotations. You may have to demonstrate to the vendor how to disable the URL recognition in Adobe Acrobat Reader, though.
In the latter case you'll have to remove everything from the text content of your PDFs that common PDF viewers recognize as URLs. You may replace each URL by a bitmap image of the URL text, or the URL text drawn like a generic vector graphic (defining a path of lines and curves and filling that), or some similar surrogate.
I'm generating a PDDocument in Java with code like this...
HashMap<Integer, PDPageContentStream> mPageContentStreamMap = new HashMap<>();
PDDocument doc = new PDDocument();
for (int i = 1; i <= mNumPages; i++) {
PDPage page = new PDPage(PDRectangle.A4);
page.setRotation(90);
PDPageContentStream pageContentStream = new PDPageContentStream(doc, page);
contentStreamMap.put(i, pageContentStream);
doc.addPage(page);
}
}
Then later save and close the document like this...
for (int i : mPageContentStreamMap.keySet()) {
mPageContentStreamMap.get(i).close();
}
doc.save("test-filename");
doc.close();
This works fine on the first run; however when I run my program multiple times I get the following error
java.io.IOException: Scratch file already closed
at org.apache.pdfbox.io.ScratchFile.checkClosed(ScratchFile.java:390)
at org.apache.pdfbox.io.ScratchFileBuffer.<init>(ScratchFileBuffer.java:78)
at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:403)
at org.apache.pdfbox.cos.COSStream.createOutputStream(COSStream.java:208)
at org.apache.pdfbox.pdmodel.common.PDStream.createOutputStream(PDStream.java:224)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:259)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:121)
If I re-run my program without the "doc.close();" line, this error goes away, but the output of the PDF is duplicated (i.e. a new PDF is generated, but with the content from the last PDF and the content from the current PDF).
Is there a way to close the stream and create multiple PDFs without running into the scratch file error?
I had created a singleton object for my drawing logic meaning after the first run, the same objects were reused when they shouldn't've been, because the input (what was being drawn) had changed.
I have two PDF files (named : A1.pdf and B1.pdf). Now I want to replace the some pages of the second PDF file (B1.pdf) with the first one (A1.pdf) programatically. In this case I am using PDFBox library.
Here is my sample code:
try {
File file = new File("/Users/test/Desktop/A1.pdf");
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = PDDocument.load(new File("/Users/test/Desktop/B1.pdf"));
document.removePage(3);
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0));
document.save("/Users/test/Desktop/"+"generatedPDFBox"+".pdf");
document.close();
}catch(Exception e){}
The idea is to replace the 3rd page. In this implementation the page is appending to the last page of the output pdf. Can anyone help me to implement this? If not with PDFBOX. Could you please suggest some other libraries in java?
This solution creates a third PDF file with the contents like you asked for. Note that pages are zerobased, so the "3" in your question must be a "2".
PDDocument a1doc = PDDocument.load(file1);
PDDocument b1doc = PDDocument.load(file2);
PDDocument resDoc = new PDDocument();
List<PDPage> a1Pages = a1doc.getDocumentCatalog().getAllPages();
List<PDPage> b1Pages = b1doc.getDocumentCatalog().getAllPages();
// replace the 3rd page of the 2nd file with the 1st page of the 1st one
for (int p = 0; p < b1Pages.size(); ++p)
{
if (p == 2)
resDoc.addPage(a1Pages.get(0));
else
resDoc.addPage(b1Pages.get(p));
}
resDoc.save(file3);
a1doc.close();
b1doc.close();
resDoc.close();
If you want to work from the command line instead, look here:
https://pdfbox.apache.org/commandline/
Then use PDFSplit and PDFMerge.
I am not too familiar with how PDFBox works, but to answer your follow up I know you can accomplish what you want to do in a fairly simple manner with the Datalogics APDFL SDK. A free trial exists in case you want to look into it. Here is a code snippet so you can see how it would be done:
Document Doc1 = new Document("/Users/test/Desktop/A1.pdf");
Document Doc2 = new Document("/Users/test/Desktop/B1.pdf");
/* Delete pages on the page range 3-3*/
Doc2.deletePages(3, 3)
/* LastPage is where in Doc2 you want to insert the page, Doc1 the document from which the page is coming from, 0 is the page number in Doc1 that will be inserted first, 1 is the number of pages that will be inserted (beginning from the page number specified in the previous parameter), and PageInsertFlags which would let you customize what gets / doesn't get copied */
Doc2.insertPages(Document.LastPage, Doc1, 0, 1, PageInsertFlags.All);
Doc2.save(EnumSet.of(SaveFlags.FULL), "out.pdf")
Alternatively, there is another method called replacePages which makes the deletion unnecessary. It all depends on what your end goal is, of course.