finding out required fields to fill in pdf file - java

I'm starting with new functionality in my android application which will help to fill certain PDF forms.
I find out that the best solution will be to use iText library.
I can read file, and read AcroFields from document but is there any possibility to find out that specific field is marked as required?
I tried to find this option in API documentation and on Internet but there were nothing which can help to solve this issue.

Please take a look at section 13.3.4 of my book, entitled "AcroForms revisited". Listing 13.15 shows a code snippet from the InspectForm example that checks whether or not a field is a password or a multi-line field.
With some minor changes, you can adapt that example to check for required fields:
for (Map.Entry<String,AcroFields.Item> entry : fields.entrySet()) {
out.write(entry.getKey());
item = entry.getValue();
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
out.write(" -> required\n");
}

check required fields without looping for itext5.
String src = "yourpath/pdf1.pdf";
String dest = "yourpath/pdf2.pdf";
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(this.reader, new FileOutputStream(dest));
AcroFields form = this.stamper.getAcroFields();
Map<String,AcroFields.Item> fields = form.getFields();
AcroFields.Item item;
PdfDictionary dict;
PdfNumber flags;
item=fields.get("fieldName");
dict = item.getMerged(0);
flags = dict.getAsNumber(PdfName.FF);
if (flags != null && (flags.intValue() & BaseField.REQUIRED) > 0)
{
System.out.println("flag has set");
}

Related

How do I remove watermark Xobject from pdf?

I'd like to remove watermark from pdf file. It is probably created by software developed by Acrobat.
The books belongs to me. It is available to anyone who has access to academic service called EBSCO. Many academic libraries have it; so my library. I downloaded the book and I want to print some part of it without annoying watermarks.
"ADBE_CompoundType" Editable watermarks (headers, footers, stamps) created by Acrobat
Information taken from here.
I used PdfContentStreamEditor class for pdfbox created by mkl and published at SO as an answer to a question. I override one method. Here it is:
#Override
protected void write(final ContentStreamWriter contentStreamWriter,
final Operator operator,
final List < COSBase > operands) throws IOException {
if (isWatermark(operator, operands)) {
final COSName xObjectName = COSName.getPDFName("Fm0");
final PDXObject fm0 = page.getResources().getXObject(xObjectName);
if (fm0 != null) {
final COSObject pieceInfo = fm0.getCOSObject()
.getCOSObject(COSName.getPDFName("PieceInfo"));
if (pieceInfo != null) {
final COSBase adbeCompoundType = pieceInfo.getDictionaryObject(
COSName.getPDFName("ADBE_CompoundType"));
if (adbeCompoundType != null) {
final COSBase privateKey = ((COSDictionary) adbeCompoundType)
.getDictionaryObject("Private");
if ("Watermark".equals(((COSName) privateKey).getName())) {
final PDResources resources = page.getResources();
resources.getCOSObject().removeItem(xObjectName);
page.getResources().getCOSObject().setNeedToBeUpdated(true);
return;
}
}
}
}
}
super.write(contentStreamWriter, operator, operands);
}
And helper method:
private boolean isWatermark(final Operator operator,
final List < COSBase > operands) {
final String operatorString = operator.getName();
return operatorString.equals("Do") &&
operands.size() == 1 && ((COSName) operands.get(0)).getName().equals("Fm0");
}
The code seems to work fine - no watermark is shown on any page. However, I cannot get rid of of the object with watermark. I tried to remove it with the following lines of code, unfortunately the object is not removed.
final PDResources resources = page.getResources(); resources.getCOSObject().removeItem(xObjectName); page.getResources().getCOSObject().setNeedToBeUpdated(true);
Here's a screenshot from pdfdebugger with watermark object:
And here's the watermark text. I couldn't find out how to check whether a watermark object contains this text and I'd like to know how to do this.
And here's one page of the pdf file: link1 and link2
You try to remove the XObject Fm0 from the resources like this:
final PDResources resources = page.getResources();
resources.getCOSObject().removeItem(xObjectName);
I.e. you fetch the COS (dictionary) object of the resources and try to remove the Fm0 (in xObjectName) entry.
If you look closely at your screenshot, though, you'll see that the Fm0 entry is not in the Resources dictionary directly. Instead there is a nested XObject dictionary entry in which in turn is the Fm0 entry.
Thus, the following should work:
final PDResources resources = page.getResources();
COSDictionary dict = (COSDictionary) (resources.getCOSObject().getDictionaryObject(COSName.XOBJECT));
dict.removeItem(xObjectName);
PDResources has some helper methods, so the following should also work:
page.getResources().put(xObjectName, (PDXObject)null);
You mention that the book belongs to you and you, therefore, are entitled to remove the watermark. That is not automatically the case. Depending on the laws (global and local) and the contracts applicable you may only have acquired the right to use the book in its current form, including the watermark. Please make sure you understand the restrictions under which you may use the book.
Also I wonder why you want to get rid of that XObject if the watermark does not show anymore and you merely wanted to change the file to print without the watermark...
Althought mkl has answered this question, I'd like to share a solution using iText library despite the fact I prefer pdfbox over iText as the former is provided free of charge. iText code is less verbose than that of pdfbox. This is because when the watermark object is removed it is automatically not shown on any page.
for (int i = 1; i <= document.getNumberOfPages(); i++) {
final PdfPage page = document.getPage(i);
final PdfDictionary xObject = page.getResources().getResource(PdfName.XObject);
if (xObject != null) {
final PdfStream fm0 = xObject.getAsStream(new PdfName("Fm0"));
if (fm0 != null) {
final PdfDictionary pieceInfo = fm0.getAsDictionary(new PdfName("PieceInfo"));
if (pieceInfo != null) {
final PdfDictionary adbeCompoundType = pieceInfo.getAsDictionary(
new PdfName("ADBE_CompoundType"));
if (adbeCompoundType != null) {
final PdfName privateKey = adbeCompoundType.getAsName(PdfName.Private);
if (privateKey != null) {
if ("Watermark".equals(privateKey.getValue())) {
xObject.remove(new PdfName("Fm0"));
}
}
}
}
}
}
}

Remove links from a PDF using iText 7.1

We have a vendor that will not accept PDFs that contain links. We are trying to remove the links by removing all link annotations from each page of the PDF using iText 7.1 (Java). We have tried multiple techniques based on research. Here are three examples of attempts to detect and remove the links. None of these result in the destination PDF (test-no-links.pdf) having the links removed. Any insight would be greatly appreciated.
Example 1: Remove based on class type of annotation
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
if( annot instanceof PdfLinkAnnotation ) {
pdfPage.removeAnnotation(annot);
}
}
}
}
pdfDoc.close();
Example 2: Remove based on annotation subtype value
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
// if this annotation has a link, delete it
if ( annot.getSubtype().equals(PdfName.Link) ) {
PdfDictionary annotAction = ((PdfLinkAnnotation)annot).getAction();
if( annotAction.get(PdfName.S).equals(PdfName.URI) ||
annotAction.get(PdfName.S).equals(PdfName.GoToR) ) {
PdfString uri = annotAction.getAsString(PdfName.URI);
System.out.println("Removing " + uri.toString());
pdfPage.removeAnnotation(annot);
}
}
}
}
}
pdfDoc.close();
Example 3: Remove all annotations (ignore annotation type)
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
// remove all annotations from the page regardless of type
pdfPage.getPdfObject().remove(PdfName.Annots);
}
pdfDoc.close();
Each of your tests generates a PDF without Link annotations.
Probably, though, your PDF viewer recognizes "www.qualpay.com" as (partial) URL and displays it as a link.
In detail
Your routines
All your tests successfully remove all Link annotations from your sample PDF, cf. these screen shots for the source and all three result files, in particular look for the page 1 Annots entry:
test-with-links.pdf
test-no-links.pdf
test-no-links-1.pdf
test-no-links-2.pdf
The viewer
Indeed, though, when viewing the PDF in Adobe Acrobat Reader (and also some other viewers, e.g. the built-in PDF viewers of Chrome and Edge), you'll see that "www.qualpay.com" is treated like a link.
The cause is that this is a feature of the PDF viewer! It scans the text of the PDF it displays for strings it recognizes as (a part of) some URL and displays them like links!
In Adobe Acrobat Reader you can disable this feature:
If you disable "Create links from URLs", you'll suddenly find the URLs in your result files inactive while the URL in your source file (with the link annotation) is still active.
What to do
We have a vendor that will not accept PDFs that contain links.
First discuss with your vendor what exactly he means by "PDFs that contain links". Does he mean
PDFs with Link annotations or
PDFs with URLs that common PDF viewers present like Link annotations.
In the former case you're done, your code (either variant) removes the link annotations. You may have to demonstrate to the vendor how to disable the URL recognition in Adobe Acrobat Reader, though.
In the latter case you'll have to remove everything from the text content of your PDFs that common PDF viewers recognize as URLs. You may replace each URL by a bitmap image of the URL text, or the URL text drawn like a generic vector graphic (defining a path of lines and curves and filling that), or some similar surrogate.

Open and Save to Same PDF File Name [duplicate]

I am required to replace a word in an existing PDF AcroField with another word. I am using PDFStamper of iTEXTSHARP to do the same and it is working fine. But, in doing so it is required to create a new PDF and i would like the change to be reflected in the existing PDF itself. If I am setting the destination filename same as the original filename then no change is being reflected.I am new to iTextSharp , is there anything I am doing wrong? Please help.. I am providing the piece of code I am using
private void ListFieldNames(string s)
{
try
{
string pdfTemplate = #"z:\TEMP\PDF\PassportApplicationForm_Main_English_V1.0.pdf";
string newFile = #"z:\TEMP\PDF\PassportApplicationForm_Main_English_V1.0.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
PdfReader reader = new PdfReader((string)pdfTemplate);
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(newFile, FileMode.Create, FileAccess.ReadWrite)))
{
AcroFields form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
//Replace Address Form field with my custom data
if (fieldKey.Contains("Address"))
{
form.SetField(fieldKey, s);
}
}
stamper.FormFlattening = true;
stamper.Close();
}
}
}
As documented in my book iText in Action, you can't read a file and write to it simultaneously. Think of how Word works: you can't open a Word document and write directly to it. Word always creates a temporary file, writes the changes to it, then replaces the original file with it and then throws away the temporary file.
You can do that too:
read the original file with PdfReader,
create a temporary file for PdfStamper, and when you're done,
replace the original file with the temporary file.
Or:
read the original file into a byte[],
create PdfReader with this byte[], and
use the path to the original file for PdfStamper.
This second option is more dangerous, as you'll lose the original file if you do something that causes an exception in PdfStamper.

iText attaching file to existing PDF/A-3 results in PdfAConformanceException

I am trying to attach a file to a existing PDF/A-3.
This example explains how to create a PDF/A-3 and attach content to it.
My next step was to adapt the code and use the PdfAStamper instead of the document.
So here is my resulting code.
private ByteArrayOutputStream append(byte[] content, InputStream inPdf) throws IOException, DocumentException {
ByteArrayOutputStream result = new ByteArrayOutputStream(16000);
PdfReader reader = new PdfReader(inPdf);
PdfAStamper stamper = new PdfAStamper(reader, result, PdfAConformanceLevel.PDF_A_3B);
stamper.createXmpMetadata();
// Creating PDF/A-3 compliant attachment.
PdfDictionary embeddedFileParams = new PdfDictionary();
embeddedFileParams.put(PARAMS, new PdfName(ZF_NAME));
embeddedFileParams.put(MODDATE, new PdfDate());
PdfFileSpecification fs = PdfFileSpecification.fileEmbedded(stamper.getWriter(), null,ZF_NAME, content , "text/xml", embeddedFileParams,0);
fs.put(AFRELATIONSHIP, Alternative);
stamper.addFileAttachment("file description",fs);
stamper.close();
reader.close();
return result;
}
Here is the Stacktrace of the error.
com.itextpdf.text.pdf.PdfAConformanceException: EF key of the file specification dictionary for an embedded file shall contain dictionary with valid F key.
at com.itextpdf.text.pdf.internal.PdfA3Checker.checkFileSpec(PdfA3Checker.java:95)
at com.itextpdf.text.pdf.internal.PdfAChecker.checkPdfAConformance(PdfAChecker.java:198)
at com.itextpdf.text.pdf.internal.PdfAConformanceImp.checkPdfIsoConformance(PdfAConformanceImp.java:70)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3380)
at com.itextpdf.text.pdf.PdfWriter.checkPdfIsoConformance(PdfWriter.java:3376)
at com.itextpdf.text.pdf.PdfFileSpecification.toPdf(PdfFileSpecification.java:309)
at com.itextpdf.text.pdf.PdfIndirectObject.writeTo(PdfIndirectObject.java:157)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.write(PdfWriter.java:424)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:402)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:381)
at com.itextpdf.text.pdf.PdfWriter$PdfBody.add(PdfWriter.java:334)
at com.itextpdf.text.pdf.PdfWriter.addToBody(PdfWriter.java:819)
at com.itextpdf.text.pdf.PdfFileSpecification.getReference(PdfFileSpecification.java:256)
at com.itextpdf.text.pdf.PdfDocument.addFileAttachment(PdfDocument.java:2253)
at com.itextpdf.text.pdf.PdfWriter.addFileAttachment(PdfWriter.java:1714)
at com.itextpdf.text.pdf.PdfStamper.addFileAttachment(PdfStamper.java:497)
Now when I try to analyze the Stacktrace and take a look atPdfFileSpecification.fileEmbedded
I see that an EF is created with an F and UF entry.
Looking inside PdfA3Checker I see that line PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F)); is not a directory but a strem.
if (fileSpec.contains(PdfName.EF)) {
PdfDictionary dict = getDirectDictionary(fileSpec.get(PdfName.EF));
if (dict == null || !dict.contains(PdfName.F)) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F));
if (embeddedFile == null) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
checkEmbeddedFile(embeddedFile);
}
Is this a bug in iText or am I missing something? By the way I am using iText 5.4.5.
Update 1
As suggested by Bruno an mkl the 4.5.6-Snapshot should contains the fix. I tried my Test case Gist link to full test case against the current trunk. But the result was the same error.
You ran into a bug very similar to the one focused in Creating PDF/A-3: Embedded file shall contain valid Params key:
The problem (as you found out) is in this code
PdfDictionary embeddedFile = getDirectDictionary(dict.get(PdfName.F));
if (embeddedFile == null) {
throw new PdfAConformanceException(obj1, MessageLocalization.getComposedMessage("ef.key.of.file.specification.dictionary.shall.contain.dictionary.with.valid.f.key"));
}
in PdfA3Checker.checkFileSpec(PdfWriter, int, Object); even though dict contains a stream named F, getDirectDictionary(dict.get(PdfName.F)) does not return it. The reason is not, though, that a dictionary is sought here (a stream essentially is a dictionary with some additions), but it is an issue in PdfAChecker.getDirectObject which is called by PdfAChecker.getDirectDictionary:
protected PdfObject getDirectObject(PdfObject obj) {
if (obj == null)
return null;
//use counter to prevent indirect reference cycling
int count = 0;
while (obj.type() == 0) {
PdfObject tmp = cachedObjects.get(new RefKey((PdfIndirectReference)obj));
if (tmp == null)
break;
obj = tmp;
//10 - is max allowed reference chain
if (count++ > 10)
break;
}
return obj;
}
This method only looks for cached objects (i.e. in cachedObjects) but in your case (and a test of mine, too) this stream has already been written to file and is not in cache anymore resulting in a null returned... correction, cf the PPS: it has been written but it has not been cached to begin with.
PS: PDF/A-3 conform file attachments work if added during PDF creation (using a PdfAWriter), but not if added during PDF manipulation (using a PdfAStamper); maybe the caching is different in these use cases.
PPS: Indeed there is a difference: PdfAWriter overrides the addToBody overloads by adding the added objects to a cache. PdfAStamperImp does not do so and furthermore is derived from PdfStamperImp and PdfWriter, not from PdfAWriter.
You've indeed hit a bug in iText 5.4.5. This bug was reported here: Creating PDF/A-3: Embedded file shall contain valid Params key
It was fixed in the SVN version of iText. We're preparing the next version of iText (due this week).

How to use Apache HWPF to extract text and images out of a DOC file

I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.
My very simple program is here:
I have 3 problems now:
Some of packages have errors (they can't find apache hdf). How I can fix them?
How I can use the methods of HWDF to find and extract the images out?
Some piece of my program is incomplete and incorrect. So please help me to complete it.
I have to complete this program in 2 days.
once again I repeat Please Please help me to complete this.
Thanks you Guys a lot for your help!!!
This is my elementary code :
public class test {
public void m1 (){
String filesname = "Hello.doc";
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filesname );
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String str = we.getText() ;
String[] paragraphs = we.getParagraphText();
Picture pic = new Picture(. . .) ;
pic.writeImageContent( . . . ) ;
PicturesTable picTable = new PicturesTable( . . . ) ;
if ( picTable.hasPicture( . . . ) ){
picTable.extractPicture(..., ...);
picTable.getAllPictures() ;
}
}
Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.
//you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
String fileName = "example.doc";
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
WordExtractor extractor = new WordExtractor(wordDoc);
String[] text = extractor.getParagraphText();
int lineCounter = text.length;
String articleStr = ""; // This string object use to store text from the word document.
for(int index = 0;index < lineCounter;++ index){
String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
int paragraphLength = paragraphStr.length();
if(paragraphLength != 0){
articleStr.concat(paragraphStr);
}
}
//you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
for(int i = 0;i < picturesList.size();++i){
BufferedImage image = null;
Picture pic = picturesList.get(i);
image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
if(image != null){
System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
}
}
If you just want to do this, and you don't care about the coding, you can just use Antiword.
$ antiword file.doc > out.txt
I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.

Categories