How do I remove watermark Xobject from pdf? - java

I'd like to remove watermark from pdf file. It is probably created by software developed by Acrobat.
The books belongs to me. It is available to anyone who has access to academic service called EBSCO. Many academic libraries have it; so my library. I downloaded the book and I want to print some part of it without annoying watermarks.
"ADBE_CompoundType" Editable watermarks (headers, footers, stamps) created by Acrobat
Information taken from here.
I used PdfContentStreamEditor class for pdfbox created by mkl and published at SO as an answer to a question. I override one method. Here it is:
#Override
protected void write(final ContentStreamWriter contentStreamWriter,
final Operator operator,
final List < COSBase > operands) throws IOException {
if (isWatermark(operator, operands)) {
final COSName xObjectName = COSName.getPDFName("Fm0");
final PDXObject fm0 = page.getResources().getXObject(xObjectName);
if (fm0 != null) {
final COSObject pieceInfo = fm0.getCOSObject()
.getCOSObject(COSName.getPDFName("PieceInfo"));
if (pieceInfo != null) {
final COSBase adbeCompoundType = pieceInfo.getDictionaryObject(
COSName.getPDFName("ADBE_CompoundType"));
if (adbeCompoundType != null) {
final COSBase privateKey = ((COSDictionary) adbeCompoundType)
.getDictionaryObject("Private");
if ("Watermark".equals(((COSName) privateKey).getName())) {
final PDResources resources = page.getResources();
resources.getCOSObject().removeItem(xObjectName);
page.getResources().getCOSObject().setNeedToBeUpdated(true);
return;
}
}
}
}
}
super.write(contentStreamWriter, operator, operands);
}
And helper method:
private boolean isWatermark(final Operator operator,
final List < COSBase > operands) {
final String operatorString = operator.getName();
return operatorString.equals("Do") &&
operands.size() == 1 && ((COSName) operands.get(0)).getName().equals("Fm0");
}
The code seems to work fine - no watermark is shown on any page. However, I cannot get rid of of the object with watermark. I tried to remove it with the following lines of code, unfortunately the object is not removed.
final PDResources resources = page.getResources(); resources.getCOSObject().removeItem(xObjectName); page.getResources().getCOSObject().setNeedToBeUpdated(true);
Here's a screenshot from pdfdebugger with watermark object:
And here's the watermark text. I couldn't find out how to check whether a watermark object contains this text and I'd like to know how to do this.
And here's one page of the pdf file: link1 and link2

You try to remove the XObject Fm0 from the resources like this:
final PDResources resources = page.getResources();
resources.getCOSObject().removeItem(xObjectName);
I.e. you fetch the COS (dictionary) object of the resources and try to remove the Fm0 (in xObjectName) entry.
If you look closely at your screenshot, though, you'll see that the Fm0 entry is not in the Resources dictionary directly. Instead there is a nested XObject dictionary entry in which in turn is the Fm0 entry.
Thus, the following should work:
final PDResources resources = page.getResources();
COSDictionary dict = (COSDictionary) (resources.getCOSObject().getDictionaryObject(COSName.XOBJECT));
dict.removeItem(xObjectName);
PDResources has some helper methods, so the following should also work:
page.getResources().put(xObjectName, (PDXObject)null);
You mention that the book belongs to you and you, therefore, are entitled to remove the watermark. That is not automatically the case. Depending on the laws (global and local) and the contracts applicable you may only have acquired the right to use the book in its current form, including the watermark. Please make sure you understand the restrictions under which you may use the book.
Also I wonder why you want to get rid of that XObject if the watermark does not show anymore and you merely wanted to change the file to print without the watermark...

Althought mkl has answered this question, I'd like to share a solution using iText library despite the fact I prefer pdfbox over iText as the former is provided free of charge. iText code is less verbose than that of pdfbox. This is because when the watermark object is removed it is automatically not shown on any page.
for (int i = 1; i <= document.getNumberOfPages(); i++) {
final PdfPage page = document.getPage(i);
final PdfDictionary xObject = page.getResources().getResource(PdfName.XObject);
if (xObject != null) {
final PdfStream fm0 = xObject.getAsStream(new PdfName("Fm0"));
if (fm0 != null) {
final PdfDictionary pieceInfo = fm0.getAsDictionary(new PdfName("PieceInfo"));
if (pieceInfo != null) {
final PdfDictionary adbeCompoundType = pieceInfo.getAsDictionary(
new PdfName("ADBE_CompoundType"));
if (adbeCompoundType != null) {
final PdfName privateKey = adbeCompoundType.getAsName(PdfName.Private);
if (privateKey != null) {
if ("Watermark".equals(privateKey.getValue())) {
xObject.remove(new PdfName("Fm0"));
}
}
}
}
}
}
}

Related

Remove Large Tokens from PDF using PDFBox or equivalent library

I have PDF:s with a extremely large tokens plastered across the entire front page of many pdf documents, see image. I'm looking for an automated method to remove these.
Apache PDFBox has a pretty extensive API, is there any way to match these tokens by Regex and simply remove them and re-save the pdf?
Image from PDF Example posted below. The tokens I'd like to remove are: [KS/2019:589] LokalvÄrd Grundskolor & Idrottshallar that are plastered on top of the regular text. Google Drive link to full PDF-file.
You can use the PdfContentStreamEditor class from this answer (don't forget to apply the fix mentioned at the bottom of the answer) like this:
try ( PDDocument document = ... ) {
PDPage page = document.getPage(0);
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
float fs = getGraphicsState().getTextState().getFontSize();
Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
double transformedFs = transformedFsVector.distance(transformedOrigin);
if (transformedFs > 50)
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
editor.processPage(page);
document.save(...);
}
(EditPageContent test testRemoveBigTextKommersAnnonsElite)
You can find some explanations in the referenced answer.

Remove links from a PDF using iText 7.1

We have a vendor that will not accept PDFs that contain links. We are trying to remove the links by removing all link annotations from each page of the PDF using iText 7.1 (Java). We have tried multiple techniques based on research. Here are three examples of attempts to detect and remove the links. None of these result in the destination PDF (test-no-links.pdf) having the links removed. Any insight would be greatly appreciated.
Example 1: Remove based on class type of annotation
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
if( annot instanceof PdfLinkAnnotation ) {
pdfPage.removeAnnotation(annot);
}
}
}
}
pdfDoc.close();
Example 2: Remove based on annotation subtype value
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
// if this annotation has a link, delete it
if ( annot.getSubtype().equals(PdfName.Link) ) {
PdfDictionary annotAction = ((PdfLinkAnnotation)annot).getAction();
if( annotAction.get(PdfName.S).equals(PdfName.URI) ||
annotAction.get(PdfName.S).equals(PdfName.GoToR) ) {
PdfString uri = annotAction.getAsString(PdfName.URI);
System.out.println("Removing " + uri.toString());
pdfPage.removeAnnotation(annot);
}
}
}
}
}
pdfDoc.close();
Example 3: Remove all annotations (ignore annotation type)
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
// remove all annotations from the page regardless of type
pdfPage.getPdfObject().remove(PdfName.Annots);
}
pdfDoc.close();
Each of your tests generates a PDF without Link annotations.
Probably, though, your PDF viewer recognizes "www.qualpay.com" as (partial) URL and displays it as a link.
In detail
Your routines
All your tests successfully remove all Link annotations from your sample PDF, cf. these screen shots for the source and all three result files, in particular look for the page 1 Annots entry:
test-with-links.pdf
test-no-links.pdf
test-no-links-1.pdf
test-no-links-2.pdf
The viewer
Indeed, though, when viewing the PDF in Adobe Acrobat Reader (and also some other viewers, e.g. the built-in PDF viewers of Chrome and Edge), you'll see that "www.qualpay.com" is treated like a link.
The cause is that this is a feature of the PDF viewer! It scans the text of the PDF it displays for strings it recognizes as (a part of) some URL and displays them like links!
In Adobe Acrobat Reader you can disable this feature:
If you disable "Create links from URLs", you'll suddenly find the URLs in your result files inactive while the URL in your source file (with the link annotation) is still active.
What to do
We have a vendor that will not accept PDFs that contain links.
First discuss with your vendor what exactly he means by "PDFs that contain links". Does he mean
PDFs with Link annotations or
PDFs with URLs that common PDF viewers present like Link annotations.
In the former case you're done, your code (either variant) removes the link annotations. You may have to demonstrate to the vendor how to disable the URL recognition in Adobe Acrobat Reader, though.
In the latter case you'll have to remove everything from the text content of your PDFs that common PDF viewers recognize as URLs. You may replace each URL by a bitmap image of the URL text, or the URL text drawn like a generic vector graphic (defining a path of lines and curves and filling that), or some similar surrogate.

how to recognize a pdf programmatically(in java) whether it is normal(Searchable) or Scanned(image)?

I am Working on Pdfs to Excel conversion using docparser.
But docparser is unable to process scanned pdfs properly. So I need to seperate the scanned pdfs from normal pdfs and only want to process normal pdfs through docparser(i.e API call).
Is there exit some to way to identify the pdf type(Scanned or normal) programmatically so that I could work further?
Please help if anyone knows how to tackle this problem.....
Finally, I found a solution to my question.But not a standard one(I THINK SO). Thanks to the people who commented and provide some help.
Using Pdfbox library we can extract pages of scanned pdf and will compare each page to the instance of an image object(PDImageXObject),if it comes true , the page will be count as an image and we can count those images.If images are equal to number of pages in pdf. We will say it is a scanned pdf.
here is the code...
public static String testPdf(String filename) throws IOException
{
String s = "";
int g = 0;
int gg = 0;
PDDocument doc = PDDocument.load(new File(filename));
gg = doc.getNumberOfPages();
for(PDPage page:doc.getPages())
{
PDResources resource = page.getResources();
for(COSName xObjectName:resource.getXObjectNames())
{
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject)
{
((PDImageXObject) xObject).getImage();
g++;
}
}
}
doc.close();
if(g==gg) // pdf pages if equal to the images
{
return "Scanned pdf";
}
else
{
return "Searchable pdf";
}
}

finding out required fields to fill in pdf file

I'm starting with new functionality in my android application which will help to fill certain PDF forms.
I find out that the best solution will be to use iText library.
I can read file, and read AcroFields from document but is there any possibility to find out that specific field is marked as required?
I tried to find this option in API documentation and on Internet but there were nothing which can help to solve this issue.
Please take a look at section 13.3.4 of my book, entitled "AcroForms revisited". Listing 13.15 shows a code snippet from the InspectForm example that checks whether or not a field is a password or a multi-line field.
With some minor changes, you can adapt that example to check for required fields:
for (Map.Entry<String,AcroFields.Item> entry : fields.entrySet()) {
out.write(entry.getKey());
item = entry.getValue();
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
out.write(" -> required\n");
}
check required fields without looping for itext5.
String src = "yourpath/pdf1.pdf";
String dest = "yourpath/pdf2.pdf";
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(this.reader, new FileOutputStream(dest));
AcroFields form = this.stamper.getAcroFields();
Map<String,AcroFields.Item> fields = form.getFields();
AcroFields.Item item;
PdfDictionary dict;
PdfNumber flags;
item=fields.get("fieldName");
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
{
System.out.println("flag has set");
}

iText attaching file to existing PDF/A-3 results in PdfAConformanceException

I am trying to attach a file to a existing PDF/A-3.
This example explains how to create a PDF/A-3 and attach content to it.
My next step was to adapt the code and use the PdfAStamper instead of the document.
So here is my resulting code.
private ByteArrayOutputStream append(byte[] content, InputStream inPdf) throws IOException, DocumentException {
ByteArrayOutputStream result = new ByteArrayOutputStream(16000);
PdfReader reader = new PdfReader(inPdf);
PdfAStamper stamper = new PdfAStamper(reader, result, PdfAConformanceLevel.PDF_A_3B);
stamper.createXmpMetadata();
// Creating PDF/A-3 compliant attachment.
PdfDictionary embeddedFileParams = new PdfDictionary();
embeddedFileParams.put(PARAMS, new PdfName(ZF_NAME));
embeddedFileParams.put(MODDATE, new PdfDate());
PdfFileSpecification fs = PdfFileSpecification.fileEmbedded(stamper.getWriter(), null,ZF_NAME, content , "text/xml", embeddedFileParams,0);
fs.put(AFRELATIONSHIP, Alternative);
stamper.addFileAttachment("file description",fs);
stamper.close();
reader.close();
return result;
}
Here is the Stacktrace of the error.
com.itextpdf.text.pdf.PdfAConformanceException: EF key of the file specification dictionary for an embedded file shall contain dictionary with valid F key.
at com.itextpdf.text.pdf.internal.PdfA3Checker.checkFileSpec(PdfA3Checker.java:95)
at com.itextpdf.text.pdf.internal.PdfAChecker.checkPdfAConformance(PdfAChecker.java:198)
at com.itextpdf.text.pdf.internal.PdfAConformanceImp.checkPdfIsoConformance(PdfAConformanceImp.java:70)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3380)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3376)
at com.itextpdf.text.pdf.PdfFileSpecification.toPdf(PdfFileSpecification.java:309)
at com.itextpdf.text.pdf.PdfIndirectObject.writeTo(PdfIndirectObject.java:157)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.write(PdfWriter.java:424)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:402)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:381)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:334)
at com.itextpdf.text.pdf.PdfWriter.addToBody(PdfWriter.java:819)
at com.itextpdf.text.pdf.PdfFileSpecification.getReference(PdfFileSpecification.java:256)
at com.itextpdf.text.pdf.PdfDocument.addFileAttachment(PdfDocument.java:2253)
at com.itextpdf.text.pdf.PdfWriter.addFileAttachment(PdfWriter.java:1714)
at com.itextpdf.text.pdf.PdfStamper.addFileAttachment(PdfStamper.java:497)
Now when I try to analyze the Stacktrace and take a look atPdfFileSpecification.fileEmbedded
I see that an EF is created with an F and UF entry.
Looking inside PdfA3Checker I see that line PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F)); is not a directory but a strem.
if (fileSpec.contains(PdfName.EF)) {
PdfDictionary dict = getDirectDictionary(fileSpec.get(PdfName.EF));
if (dict == null || !dict.contains(PdfName.F)) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F));
if (embeddedFile == null) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
checkEmbeddedFile(embeddedFile);
}
Is this a bug in iText or am I missing something? By the way I am using iText 5.4.5.
Update 1
As suggested by Bruno an mkl the 4.5.6-Snapshot should contains the fix. I tried my Test case Gist link to full test case against the current trunk. But the result was the same error.
You ran into a bug very similar to the one focused in Creating PDF/A-3: Embedded file shall contain valid Params key:
The problem (as you found out) is in this code
PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F));
if (embeddedFile == null) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
in PdfA3Checker.checkFileSpec(PdfWriter, int, Object); even though dict contains a stream named F, getDirectDictionary(dict.get(PdfName.F)) does not return it. The reason is not, though, that a dictionary is sought here (a stream essentially is a dictionary with some additions), but it is an issue in PdfAChecker.getDirectObject which is called by PdfAChecker.getDirectDictionary:
protected PdfObject getDirectObject(PdfObject obj) {
if (obj == null)
return null;
//use counter to prevent indirect reference cycling
int count = 0;
while (obj.type() == 0) {
PdfObject tmp = cachedObjects.get(new RefKey((PdfIndirectReference)obj));
if (tmp == null)
break;
obj = tmp;
//10 - is max allowed reference chain
if (count++ > 10)
break;
}
return obj;
}
This method only looks for cached objects (i.e. in cachedObjects) but in your case (and a test of mine, too) this stream has already been written to file and is not in cache anymore resulting in a null returned... correction, cf the PPS: it has been written but it has not been cached to begin with.
PS: PDF/A-3 conform file attachments work if added during PDF creation (using a PdfAWriter), but not if added during PDF manipulation (using a PdfAStamper); maybe the caching is different in these use cases.
PPS: Indeed there is a difference: PdfAWriter overrides the addToBody overloads by adding the added objects to a cache. PdfAStamperImp does not do so and furthermore is derived from PdfStamperImp and PdfWriter, not from PdfAWriter.
You've indeed hit a bug in iText 5.4.5. This bug was reported here: Creating PDF/A-3: Embedded file shall contain valid Params key
It was fixed in the SVN version of iText. We're preparing the next version of iText (due this week).

Categories