read text from a particular page using PDFBox [duplicate] - java

This question already has answers here:
Reading a particular page from a PDF document using PDFBox
(6 answers)
Closed 9 years ago.
I know how to read text of an entire pdf file usinf PDFBox using PDFTextStripper.getText(PDDocument).
I also have a sample on how to get an object reference to a particular page using PDDocumentCatalog.getAllPages().get(i).
How do I get the text of just one page using PDFBox as I dont see any such method on PDPage class?

You can set parameters on the PDFTextStripper to read particular pages:
PDDocument doc; // document
int i; // page no.
PDFTextStripper reader = new PDFTextStripper();
reader.setStartPage(i);
reader.setEndPage(i);
String pageText = reader.getText(doc);
As far as I'm aware, PDPage is more used with representing a page onscreen, rather than extracting text. As such, I wouldn't recommend using this to extract text.

Related

ITEXT and PDFBOX is not detecting all the form fields present in the pdf

In this code I've used for finding the number of fields in the pdf using Itext and PDFBOX with Java, I'm attaching the pdf, it has 11 fields but the fields present in the page 1 are not getting detected and the size being printed is 2 for the cases.
PdfDocument doc = new PdfDocument(new PdfReader(file));
PdfAcroForm form = PdfAcroForm.getAcroForm(doc, true);
System.out.println("form fields size from Itext:"+form.getFormFields().size());
PDDocument document = PDDocument.load(file);
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
List<PDField> fields = acroForm.getFields();
System.out.println("form fields size from PDFBOX:"+fields.size());
PDF FILE HERE IN THIS LINK
The form information in your PDF is inconsistent.
The global AcroForm form definition in your PDF contains only 2 fields, Text Field 6 and Text Field 7, which happen to be the two fields on page two.
Page one in its Annots array references ten form field widgets, each of them merged with a form field object. These fields are not referenced from the AcroForm form definition in your PDF. Thus, they are not part of the form of the PDF but merely some arbitrary annotations hanging around.
To fix the issue, simply reference the form fields of the widget annotations of page one from the AcroForm form definition.

pdfBox Find page no that contains signature field [duplicate]

This question already has answers here:
how to get field page in PDFBox API 2?
(1 answer)
how to know if a field is on a particular page?
(4 answers)
Closed 2 years ago.
If I have a 2 page pdf document with a signature field (signature1), how can I parse the document using pdfBox to find which page contains the signature field (either blank or signed).
OR how can I find the page No for signature1 in a multi page pdf document?
I can successfully add a signature field to page 2:
page = doc.getPage(1)
widget = signatureField.getWidgets().get(0)
widget.setAppearance(appearanceDictionary)
widget.setRectangle(rect)
//set it to page 2
widget.setPage(page)
from code example:
https://www.programcreek.com/java-api-examples/?api=org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField
assuming you have the widget and it is not null:
PDPage signaturePage = widget.getPage();
int pageIndex = document.getPages().indexOf(signaturePage);
now you have the 0-based page number.

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

PDFBox create oversized pages

when I open my PDF created by PDFBOX with PAGE_SIZE_A4, in Adobe to print. It says that it will be "Shrink oversized pages" to 96%. Even if decrease the page size by myself it shows "Shrink oversized pages" to 100%.
I know it is may a duplicate to: How to set Page Scaling option in Apache PDfBox. But this is already 2 years ago.
Usign: latest pdfbox 1.8.9
My example code:
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4); // new PDPage(595.27563f, 841.8898);
document.addPage(page);
PDPageContentStream cs = new PDPageContentStream(document, page);
/* With or without content */
cs.close();
document.save(pdfFile);
document.close();
The workaround with images is not an option.
iText is not an option.
Thank you.

Editing an existing PDF without using iText

I want to add an index page to existing PDF file. And add page numbers to the page of the pdf file.
All the suggested solutions point towards creating a new pdf and merging the existing pdf file with the new one.
Is there any other way for this ??
Also I dont want to use itext since its not free for commercial uses.
According to your comments to the original question, you think in PDFBox
for adding a new page & content, you need to create a new pdf add new content and then merge the existing pdf. I wanted to avoid the merging step. Renaming is not the issue
You might want to try something like this:
PDDocument doc = PDDocument.load(new FileInputStream(new File("original.pdf")));
PDPage page = new PDPage();
// fill page
doc.addPage(page);
doc.save("original-plus-page.pdf");
EDIT: In a comment to the answer the question arose how to insert a new page at specific index(page number). To do this, obviously the doc.addPage(page) has to be changed somehow.
Originally this PDDocument method is defined like this:
/**
* This will add a page to the document. This is a convenience method, that
* will add the page to the root of the hierarchy and set the parent of the
* page to the root.
*
* #param page The page to add to the document.
*/
public void addPage( PDPage page )
{
PDPageNode rootPages = getDocumentCatalog().getPages();
rootPages.getKids().add( page );
page.setParent( rootPages );
rootPages.updateCount();
}
We merely need a similar function which merely does not simply add the page to the kids but instead adds it at a given index. Thus a helper method like the following in our code will do:
public static void addPage(PDDocument doc, int index, PDPage page)
{
PDPageNode rootPages = doc.getDocumentCatalog().getPages();
rootPages.getKids().add(index, page);
page.setParent(rootPages);
rootPages.updateCount();
}
If you now replace the line
doc.addPage(page);
in the code of the original answer by
addPage(doc, page, 0);
the empty page is added up front.

Categories