PDFBox extracting blanks from PDF encrypted with no password - java

I'm using PDFBox to extract text from forms and I have a PDF that is not encrypted with a password but PDFBox says is encrypted. I suspect some sort of Adobe "feature" since when I open it it says (SECURED), while other PDFs that I don't have issues with do not. isEncrypted() returns true so despite not having a password it appears to be secured somehow.
I suspect that it is not decrypting properly, as it is able to pull the form's text prompts but not the responses themselves. In the code below it pulls Address (Street Name and Number) and City from the sample PDF, but not the response in between them.
I am using PDFBox 2.0, but I have also tried 1.8.
I've tried every method of decrypting that I can find for PDFBox, including the deprecated ones (why not). I get the same result as not trying to decrypt at all, just the Address and City prompts.
With PDF's being the absolute nightmare that they are, this PDF was likely created in some non-standard way. Any help in identifying this and getting moving again is appreciated.
Sample PDF
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDPage;
import java.awt.Rectangle;
import java.util.List;
class Scratch {
private static float pwidth;
private static float pheight;
private static int widthByPercent(double percent) {
return (int)Math.round(percent * pwidth);
}
private static int heightByPercent(double percent) {
return (int)Math.round(percent * pheight);
}
public static void main(String[] args) {
try {
//Create objects
File inputStream = new File("ocr/TestDataFiles/i-9_08-07-09.pdf");
PDDocument document = PDDocument.load(inputStream);
// Try every decryption method I've found
if(document.isEncrypted()) {
// Method 1
document.decrypt("");
// Method 2
document.openProtection(new StandardDecryptionMaterial(""));
// Method 3
document.setAllSecurityToBeRemoved(true);
System.out.println("Removed encryption");
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
//Get the page with data on it
PDPageTree allPages = document.getDocumentCatalog().getPages();
PDPage page = allPages.get(3);
pheight = page.getMediaBox().getHeight();
pwidth = page.getMediaBox().getWidth();
Rectangle LastName = new Rectangle(widthByPercent(0.02), heightByPercent(0.195), widthByPercent(0.27), heightByPercent(0.1));
stripper.addRegion("LastName", LastName);
stripper.setSortByPosition(true);
stripper.extractRegions(page);
List<String> regions = stripper.getRegions();
System.out.println(stripper.getTextForRegion("LastName"));
} catch (Exception e){
System.out.println(e.getMessage());
}
}
}

Brunos comment explains why the PDF is encrypted even though you do not need to enter a password:
A PDF can be encrypted with two passwords: a user password and an owner password. When a PDF is encrypted with a user password, you can't open the document in a PDF viewer without entering that password. When a PDF is encrypted with an owner password only, everyone can open a PDF without that password, but some restrictions may be in place. You can recognize PDFs encrypted with an owner password because they mention "SECURED" in Adobe Reader.
Your PDF is encrypted using only an owner password, i.e. the user password is empty. Thus, you can decrypt it using the empty password "" like this in your PDFBox version:
document.decrypt("");
(This "method 1", by the way, is exactly the same as your "method 2"
document.openProtection(new StandardDecryptionMaterial(""));
plus some exception wrapping.)
Tilman's comment implies the reason why you don't retrieve the form values: Your code uses the PDFTextStripperByArea to do text extraction, but this text extraction only extracts the fixed page content, not the content of the annotations floating on that page.
The content you want to extract is the content of form fields whose widgets are annotations.
Tilman's proposal
doc.getDocumentCatalog().getAcroForm().getField("form1[0].#subform[3].address[0]").getValueAsString()
shows how to extract the value of a form field you know the name of, "form1[0].#subform[3].address[0]" in this case. If you don't know the name of the field you want to extract content from, the PDAcroForm object returned by doc.getDocumentCatalog().getAcroForm() has a number of other methods to access field contents.
By the way, a field name like "form1[0].#subform[3].address[0]" in the AcroForm definition indicates yet another specialty of your PDF: It actually contains two form definitions, the core PDF AcroForm definition and the more independent XFA definition. Both describe the same visual form. Such a PDF form is called a hybrid PDF form.
The advantage of hybrid forms is that they can be viewed and filled in using PDF tools which only know AcroForm forms (which is essentially all software except Adobe's) while PDF tools with XFA support (essentially only Adobe's software) can make use of additional XFA features.
The drawback of hybrid forms is that if they are filled in using a tool without XFA support, only the AcroForm information are updated while the XFA information remain as before. Thus, the hybrid document can contain different data for the same field...

Related

How to fill a secured PDF from spreadsheet data [Java]

I have looked into two libraries for doing this to no success. I am not the most experienced.
PDFBox - I think because it is a secured pdf the PDDocument class was unable to load the fields to fill.
Adobe FDFToolkit - I couldn't get the fields from the file because it was a PDF not an FDF. Not sure how to convert.
iText - org/bouncycastle/asn1/ASN1OctetString error while opening the PDF
I am having trouble getting any of these to work due to the nature of the file. It is a government immigration form which can be found here: https://www.uscis.gov/sites/default/files/files/form/i-589.pdf. Any ideas for working around this?
Your form is encrypted using an owner password. The permissions are set in such a way that they allow form filling, but iText nor PdfBox are currently fine-grained enough to check those permissions: if a PDF is encrypted, you are asked to provide a password.
However, with iText, there is a setting called unethicalreading. See How to decrypt a PDF document with the owner password? in the official documentation:
PdfReader.unethicalreading = true;
By setting this static variable to true, the PDF will be treated as if it weren't encrypted.

How to create a PDF/DOCX files in Java/Scala?

I am creating a web application which will accept some inputs from user (like name, age, address etc) and generate some predefined forms with filled information for user to download and print.
For example, an Application Form for driving license or something along those lines. The backend will have the format information about the document to be generated and other information will be gathered from user from front-end.
I am going to use Play Framework 2.5 for this and Java/Scala as programming language. But right now I am not aware if there are any free libraries/APIs that I can use to achieve this document generation.
I should be able to manipulate the font size, style, indentations, paragraphs, page borders, page numbers, alignments, document headers and footers, page size (A4, Legal etc) some other basic stuff. And I need documents in format that are widely supported for editing and printing purposes. Like PDF, DOCX for example. DOCX is preferred so user can edit something after downloading the document before taking a print out.
I have used the apache POI library to parse and create ms word documents (including docx) files:
http://www.tutorialspoint.com/apache_poi_word/apache_poi_word_quick_guide.htm
It's not amazing but it's the best I've found :)
I have used docx4j.jar which simply converts xhtml to docx.
What you can do for your requirement is save your format information as xhtml template and place input from form (like name,age,address etc) into the template at runtime.
This is a sample code to refer from this link
public static void main(String[] args) throws Exception
{
String xhtml=
"<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\" style=\"width:100%;\"><tbody><tr><td>test</td><td>test</td></tr><tr><td>test</td><td>test</td></tr><tr><td>test</td><td>test</td></tr></tbody></table>";
// To docx, with content controls
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
wordMLPackage.save(new java.io.File("D://sample.docx"));
}

How to distinguish between two encrypted / secured PDF files

I have two secured pdf files. One has a password and the other one is secured but without password. I am using PDF Box.
How can I identify which file has password and which one is secured but without password?
PDF's have two type of encryption -
Owner password - Password set by PDF owner / creator to restrict its usage (e.g. edit, print, copy etc)
User password - Password set to open / view the PDF
PDF can have only owner password or both; but not only user password. In either case the PDF is termed to be encrypted and there is no direct API to distinguish between two kind of encryption.
In case of PDFBox you can use below code snippet to determine if it is encrypted or not; and distinguish whether it has only owner password or both.
PDDocument pdfDoc = PDDocument.load(new File("path/to/pdf"));
boolean hasOwnerPwd = false;
boolean hasUserPwd = false;
if(pdfDoc.isEncrypted()){
hasOwnerPwd = true;
try{
StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(null);
pdfDoc.openProtection(sdm);
hasUserPwd = true;
} catch(Exception e){
// handle exception
}
}
See PDFBox API docs here and here.
EDIT Thanks to Tilman to point out latest code and alternate way to determine / distinguish between two encryption. Updated the code snippet and post accordingly.

How to automate PDF form-filling in Java

I am doing some "pro bono" development for a food pantry near where I live. They are inundated with forms and paperwork, and I would like to develop a system that simply reads data from their MySQL server (which I set up for them on a previous project) and feeds data into PDF versions of all the forms they are required to fill out. This will help them out enormously and save them a lot of time, as well as get rid of a lot of human errors that are made when filling out these forms.
Not knowing anything about the internals of PDF files, I can foresee two avenues here:
Harder Way: It is possible to scan a paper document, turn it into a PDF, and then have software that "fills out" the PDF simply by saying "add text except blah to the following (x,y) coordinates..."; or
Easier Way: PDF specification already allows for the construct of "fields" that can be filled out; this way I just write code that says "add text excerpt blah to the field called *address_value*...", etc.
So my first question is: which of the two avenues am I facing? Does PDF have a concept of "fields" or do I need to "fill out" these documents by telling the PDF library the pixel coordinates of where to place data?
Second, I obviously need an open source (and Java) library to do this. iText seems to be a good start but I've heard it can be difficult to work with. Can anyone lend some ideas or general recommendations here? Thanks in advance!
You can easily merge data into PDF's fields using the FDF(Form Data Format) technology.
Adobe provides a library to do that : Acrobat Forms Data Format (FDF) Toolkit
Also Apache PDFBox can be used to do that.
Please take a look at the chapter about interactive forms in the free ebook The Best iText Questions on StackOverflow. It bundles the answers to questions such as:
How to fill out a pdf file programatically?
How can I flatten a XFA PDF Form using iTextSharp?
Checking off pdf checkbox with itextsharp
How to continue field output on a second page?
finding out required fields to fill in pdf file
and so on...
Or you can watch this video where I explain how to use forms for reporting step by step.
See for instance:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields fields = stamper.getAcroFields();
fields.setField("name", "CALIFORNIA");
fields.setField("abbr", "CA");
fields.setField("capital", "Sacramento");
fields.setField("city", "Los Angeles");
fields.setField("population", "36,961,664");
fields.setField("surface", "163,707");
fields.setField("timezone1", "PT (UTC-8)");
fields.setField("timezone2", "-");
fields.setField("dst", "YES");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
}
public void fillPDF()
{
try {
PDDocument pDDocument = PDDocument.load(new File("D:/pdf/pdfform.pdf")); // pdfform.pdf is input file
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("Given Name Text Box");
field.setValue("firstname");
field = pDAcroForm.getField("Family Name Text Box");
field.setValue("lastname");
field = pDAcroForm.getField("Country Combo Box");
field.setValue("Country");
System.out.println("country combo" );
field = pDAcroForm.getField(" Driving License Check Box");
field = pDAcroForm.getField("Favourite Colour List Box");
System.out.println("country combo"+ field.isRequired());
pDDocument.save("D:/pdf/pdf-java-output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Securing PDF Generated from iTextPdf

I have made a software that generate a pdf as the part of its function, I am using iTextPDF Java library to generate PDF. For a demo version of my software, I added text watermarking (like "demo software") by use of following code
PdfContentByte under = writer.getDirectContentUnder();
BaseFont baseFont = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.WINANSI, BaseFont.EMBEDDED);
under.beginText();
under.setColorFill(BaseColor.RED);
under.setFontAndSize(baseFont, 25);
under.showTextAligned(PdfContentByte.ALIGN_CENTER," demo software",250, 470,55);
under.endText();
After it I converted it to .docx format using PDF to Word converter and the resultant docx file does not contain the watermark also the contents are easily editable so as a result the sole purpose of giving demo software is vanished.
How can I achieve permanent watermarking so that pdf to word converter wont be able to remove it.
One idea come to my mind is that instead of putting the text in the pdf there should be a way of converting all the text of a page first into an image then making the pdf comprising of those images. But I am unsure on how to achieve this using iTextPdf.
You can encrypt your PDF so that it cannot be modified without an owner password, after you have generated your PDF, create a PDFStamper with your PDF as input
and encrypt the pdf like the following:
final PdfReader reader = new PdfReader(your_input_stream);
final PdfStamper stamper = new PdfStamper(reader, your_output_stream);
stamper.setEncryption(PdfWriter.ENCRYPTION_AES_128 | PdfWriter.DO_NOT_ENCRYPT_METADATA,
"your_user_password", "your_owner_password", PdfWriter.ALLOW_PRINTING);
stamper.close();
As a side note, i would recommend not using a hardcoded owner password; since you have no need for the owner password after the file has been generated, I would suggest making it a SHA hash of a random string of say 20 alphanumeric characters.

Categories