We have a vendor that will not accept PDFs that contain links. We are trying to remove the links by removing all link annotations from each page of the PDF using iText 7.1 (Java). We have tried multiple techniques based on research. Here are three examples of attempts to detect and remove the links. None of these result in the destination PDF (test-no-links.pdf) having the links removed. Any insight would be greatly appreciated.
Example 1: Remove based on class type of annotation
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
if( annot instanceof PdfLinkAnnotation ) {
pdfPage.removeAnnotation(annot);
}
}
}
}
pdfDoc.close();
Example 2: Remove based on annotation subtype value
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
// if this annotation has a link, delete it
if ( annot.getSubtype().equals(PdfName.Link) ) {
PdfDictionary annotAction = ((PdfLinkAnnotation)annot).getAction();
if( annotAction.get(PdfName.S).equals(PdfName.URI) ||
annotAction.get(PdfName.S).equals(PdfName.GoToR) ) {
PdfString uri = annotAction.getAsString(PdfName.URI);
System.out.println("Removing " + uri.toString());
pdfPage.removeAnnotation(annot);
}
}
}
}
}
pdfDoc.close();
Example 3: Remove all annotations (ignore annotation type)
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
// remove all annotations from the page regardless of type
pdfPage.getPdfObject().remove(PdfName.Annots);
}
pdfDoc.close();
Each of your tests generates a PDF without Link annotations.
Probably, though, your PDF viewer recognizes "www.qualpay.com" as (partial) URL and displays it as a link.
In detail
Your routines
All your tests successfully remove all Link annotations from your sample PDF, cf. these screen shots for the source and all three result files, in particular look for the page 1 Annots entry:
test-with-links.pdf
test-no-links.pdf
test-no-links-1.pdf
test-no-links-2.pdf
The viewer
Indeed, though, when viewing the PDF in Adobe Acrobat Reader (and also some other viewers, e.g. the built-in PDF viewers of Chrome and Edge), you'll see that "www.qualpay.com" is treated like a link.
The cause is that this is a feature of the PDF viewer! It scans the text of the PDF it displays for strings it recognizes as (a part of) some URL and displays them like links!
In Adobe Acrobat Reader you can disable this feature:
If you disable "Create links from URLs", you'll suddenly find the URLs in your result files inactive while the URL in your source file (with the link annotation) is still active.
What to do
We have a vendor that will not accept PDFs that contain links.
First discuss with your vendor what exactly he means by "PDFs that contain links". Does he mean
PDFs with Link annotations or
PDFs with URLs that common PDF viewers present like Link annotations.
In the former case you're done, your code (either variant) removes the link annotations. You may have to demonstrate to the vendor how to disable the URL recognition in Adobe Acrobat Reader, though.
In the latter case you'll have to remove everything from the text content of your PDFs that common PDF viewers recognize as URLs. You may replace each URL by a bitmap image of the URL text, or the URL text drawn like a generic vector graphic (defining a path of lines and curves and filling that), or some similar surrogate.
Related
I am using the pdfbox library 2.0 version. I need to open PDF in new browser tab i.e. Print View.
As if we are migrating from iText to PDFBox below is the existing code with iText.
With below code, there is PDFAction class to achieve same. It is,
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
and to apply print Javascript on doc,
copy.addJavaScript(action);
I need equivalent solution with PDFBox.
Document document = new Document();
try{
outputStream=response.getOutputStream();
// step 2
PdfCopy copy = new PdfCopy(document, outputStream);
// step 3
document.open();
// step 4
PdfReader reader;
int n;
//add print dialog in Pdf Action to open file for preview.
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
// loop over the documents you want to concatenate
Iterator i=mergepdfFileList.iterator();
while(i.hasNext()){
File f =new File((String)i.next());
is=new FileInputStream(f);
reader=new PdfReader(is);
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
is.close();
}
copy.addJavaScript(action);
// step 5
document.close();
}catch(IOException io){
throw io;
}catch(DocumentException e){
throw e;
}catch(Exception e){
throw e;
}finally{
outputStream.close();
}
I also tried with below reference but could not find print() method of PDDocument type.
Reference Link
Please guide me with this.
This is how file looks when display in browser tab:
This code reproduces what your file has, a JavaScript action in the name tree in the JavaScript entry in the name dictionary in the document catalog. ("When the document is opened, all of the actions in this name tree shall be executed, defining JavaScript functions for use by other scripts in the document" - PDF specification) There's probably an easier way to do this, e.g. with an OpenAction.
PDActionJavaScript javascript = new PDActionJavaScript("this.print(true);\n");
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDDocumentNameDictionary names = new PDDocumentNameDictionary(documentCatalog, new COSDictionary());
PDJavascriptNameTreeNode javascriptNameTreeNode = new PDJavascriptNameTreeNode();
Map<String, PDActionJavaScript> map = new HashMap<>();
map.put("0000000000000000", javascript);
javascriptNameTreeNode.setNames(map);
names.setJavascript(javascriptNameTreeNode);
document.getDocumentCatalog().setNames(names);
I am Working on Pdfs to Excel conversion using docparser.
But docparser is unable to process scanned pdfs properly. So I need to seperate the scanned pdfs from normal pdfs and only want to process normal pdfs through docparser(i.e API call).
Is there exit some to way to identify the pdf type(Scanned or normal) programmatically so that I could work further?
Please help if anyone knows how to tackle this problem.....
Finally, I found a solution to my question.But not a standard one(I THINK SO). Thanks to the people who commented and provide some help.
Using Pdfbox library we can extract pages of scanned pdf and will compare each page to the instance of an image object(PDImageXObject),if it comes true , the page will be count as an image and we can count those images.If images are equal to number of pages in pdf. We will say it is a scanned pdf.
here is the code...
public static String testPdf(String filename) throws IOException
{
String s = "";
int g = 0;
int gg = 0;
PDDocument doc = PDDocument.load(new File(filename));
gg = doc.getNumberOfPages();
for(PDPage page:doc.getPages())
{
PDResources resource = page.getResources();
for(COSName xObjectName:resource.getXObjectNames())
{
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject)
{
((PDImageXObject) xObject).getImage();
g++;
}
}
}
doc.close();
if(g==gg) // pdf pages if equal to the images
{
return "Scanned pdf";
}
else
{
return "Searchable pdf";
}
}
I have (in most cases) 2 PDF files, 1 main PDF containing around 30000 pages, and another one containing 1 Page which i want to insert after every X amount of pages to the main one (depending on my seperate index file), while also adding barcodes to each page.
The problem i have is that the result PDF becomes VERY large (10GB+), while the in-files are 350Mb and the small one i want to insert <50kb.
Whats a good way to optimize the size of the PDF Im creating?
Heres the relevant parts of the code handling the PDF merge.
PdfImportedPage page;
PdfSmartCopy outPdf;
PdfSmartCopy.PageStamp stamp;
PdfReader pdfReader, pdfInsertReader;
...
outDoc = new Document();
outPdf = new PdfSmartCopy(outDoc, new FileOutputStream(outFile));
pdfToolset = new PDFToolset();
outDoc.open();
...
//loop over pages in my index-file
for (IndexPage index_page : item.pages) {
if (indexpage.insertPage){
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfReader);
outPdf.flush();
if (!currentDoc.equals(insertDoc)) {
insertDoc = currentDoc;
pdfInsertReader = new PdfReader(currentDoc);
}
} else if (!currentDoc.equals(indexpage.source_file)) {
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfInsertReader);
outPdf.flush();
if (!mainDoc.equals(currentDoc)){
mainDoc = currentDoc;
pdfReader = new PdfReader(mainDoc);
}
}
if (indexpage.insertPage)
page = outPdf.getImportedPage(pdfInsertReader, indexpage.source_page);
else
page = outPdf.getImportedPage(pdfReader, indexpage.source_page);
if (!duplex || (duplex && indexpage.nr % 2 == 1)) {
stamp = outPdf.createPageStamp(page);
stamp = pdfToolset.applyBarcode(stamp, indexpage.omr, indexpage.nr);
stamp.alterContents();
}
outPdf.addPage(page);
}
...
I'm starting with new functionality in my android application which will help to fill certain PDF forms.
I find out that the best solution will be to use iText library.
I can read file, and read AcroFields from document but is there any possibility to find out that specific field is marked as required?
I tried to find this option in API documentation and on Internet but there were nothing which can help to solve this issue.
Please take a look at section 13.3.4 of my book, entitled "AcroForms revisited". Listing 13.15 shows a code snippet from the InspectForm example that checks whether or not a field is a password or a multi-line field.
With some minor changes, you can adapt that example to check for required fields:
for (Map.Entry<String,AcroFields.Item> entry : fields.entrySet()) {
out.write(entry.getKey());
item = entry.getValue();
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
out.write(" -> required\n");
}
check required fields without looping for itext5.
String src = "yourpath/pdf1.pdf";
String dest = "yourpath/pdf2.pdf";
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(this.reader, new FileOutputStream(dest));
AcroFields form = this.stamper.getAcroFields();
Map<String,AcroFields.Item> fields = form.getFields();
AcroFields.Item item;
PdfDictionary dict;
PdfNumber flags;
item=fields.get("fieldName");
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
{
System.out.println("flag has set");
}
I have been trying to split one big PDF file to multiple pdf files based on its size. I was able to split it but it only creates one single file and rest of the file data is lost. Means it does not create more than one files to split it. Can anyone please help? Here is my code
public static void main(String[] args) {
try {
PdfReader Split_PDF_By_Size = new PdfReader("C:\\Temp_Workspace\\TestZip\\input1.pdf");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream("C:\\Temp_Workspace\\TestZip\\File1.pdf"));
document.open();
int number_of_pages = Split_PDF_By_Size.getNumberOfPages();
int pagenumber = 1; /* To generate file name dynamically */
// int Find_PDF_Size; /* To get PDF size in bytes */
float combinedsize = 0; /* To convert this to Kilobytes and estimate new PDF size */
for (int i = 1; i < number_of_pages; i++ ) {
float Find_PDF_Size;
if (combinedsize == 0 && i != 1) {
document = new Document();
pagenumber++;
String FileName = "File" + pagenumber + ".pdf";
copy = new PdfCopy(document, new FileOutputStream(FileName));
document.open();
}
copy.addPage(copy.getImportedPage(Split_PDF_By_Size, i));
Find_PDF_Size = copy.getCurrentDocumentSize();
combinedsize = (float)Find_PDF_Size / 1024;
if (combinedsize > 496 || i == number_of_pages) {
document.close();
combinedsize = 0;
}
}
System.out.println("PDF Split By Size Completed. Number of Documents Created:" + pagenumber);
}
catch (Exception i)
{
System.out.println(i);
}
}
}
(BTW, it would have been great if you had tagged your question with itext, too.)
PdfCopy used to close the PdfReaders it imported pages from whenever the source PdfReader for page imports switched or the PdfCopy was closed. This was due to the original intended use case to create one target PDF from multiple source PDFs in combination with the fact that many users forget to close their PdfReaders.
Thus, after you close the first target PdfCopy, the PdfReader is closed, too, and no further pages are extracted.
If I interpret the most recent checkins into the iText SVN repository correctly, this implicit closing of PdfReaders is in the process of being removed from the code. Therefore, with one of the next iText versions, your code may work as intended.