Reading a particular page from a PDF document using PDFBox - java

How do I read a particular page (given a page number) from a PDF document using PDFBox?

This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );

//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}

Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}

you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();

Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

Related

No glyph found after getting Text and Font from existing pdf

My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..

PdfBoxRenderer with custom width and height

I am having html content store as a raw string in my database and I like to print it in pdf, but with custom size, for example page size to be 10cm width and 7 com height, not standard A4 format.
Can someone gives me some examples if it is possible.
ByteArrayOutputStream out = new ByteArrayOutputStream();
PDRectangle rec = new PDRectangle(recWidth, recHeight);
PDPage page = new PDPage(rec);
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
String htmlContent = "<b>Hello world</b>" + content;
builder.withHtmlContent(htmlContent, "");
document.addPage(page);
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
ex.printStackTrace();
}
return new ByteArrayInputStream(out.toByteArray());
This code generates for me 2 files, one small and one A4.
UPDATE:
I tried this one:
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
String htmlContent = "<b>content</b>";
builder.withHtmlContent(htmlContent, "");
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
log.error(">>> The creation of PDF is invalid!");
}
But in this case content is not shown, if I remove useDefaultPageSize, content will be shown
I didn't check this solution before, but try initialise the builder object with your desired page size and document type like below
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
the lib include many PDF format next is PdfAConformance Enum with possible values
PdfAConformance Enum

What is the correct way to deep clone PDPage?

I am working with PDFBOX v2, I'm trying to clone the first PDPage of a PDDocument for keep it as template for new PDPages. That first page, has some acroform fields that I need fill.
I tried some methods but anyone makes I want to achieve.
1) Copy the first page content and add it to the document when I need a new page. That copy the page but the acroform field are linked with other pages fields, and if I modify field value from the first page, that shows in the other pages.
//Save in variable first page content
COSDictionary pageContent = (COSDictionary)doc.getPage(0).getCOSObject();
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
2) Clone the first page content and then add to the document like the first method. That copy the page but no field is copied :/
PDFCloneUtility cloner = new PDFCloneUtility(doc);
COSDictionary pageContent = (COSDictionary)cloner.cloneForNewDocument(doc.getPage(0).getCOSObject());
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
Then, what is the correct way to make a deep copy of a PDPage getting acroform fields independent from the first page?
Thanks!
I got the solution!
1) Start with an empty pdf template, only has 1 page. Open template document, fill common data and save as byte[] in memory.
PDDocument templatedoc = PDDocument.load(new File(path));
PDDocumentCatalog catalog = templatedoc.getDocumentCatalog();
PDAcroFrom acroForm = catalog.getAcroForm());
... fill acroForm common data of all pages ...
ByteArrayOutputStream basicTemplate = new ByteArrayOutputStream();
templatedoc.save(basicTemplate);
byte[] filledBasicTemplate = basicTemplate.toByteArray();
2) Generate new document for each needed page.
List<PDDocument> documents = new ArrayList<PDDocument>();
PDDocument activeDoc;
for(int i = 0; i < 5; i++) {
activeDoc = PDDocument.load(filledBasicTemplate);
documents.add(activeDoc);
... fill acroform or you need in each page ...
}
3) Import all new document first pages into final document and save final document.
PDDocument finalDoc = new PDDocument();
for(PDDocument currentDoc : documents) {
... fill common things like page numbers ...
finalDoc.importPage(currentDoc.getPage(0));
}
finalDoc.save(new File(path));
... close all documents after save the final document ...
It maybe not be the most optimized code, but it works.

Reading text in pdf file and split into multiple pdf files

I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?
I tried the below code:
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(randomAccessFile);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This is resulting me blank text though my pdf is having data.

Reading barcode from PDF using PDF Box

I am having a pdf with barcodes. As shown in the image
I am stuck on an issue and not able to proceed with my project. I am using PDFBox for parsing of the PDF and able to convert the whole pdf in the text format as shown in the code below:
public static PdfValues readPdf() throws IOException {
System.out.println("Main Method Started");
File file = new File("D:/po/temp/output.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
text = text.trim();
text = text.replaceAll(" +", " ");
text = text.replaceAll("(?m)^[ \t]*\r?\n", "");
// System.out.println(text);
deleteIfExist();
writeToFile(text);
PdfValues infos = readData();
document.close();
System.out.println("Main Method Ended");
return infos;
}
But I am not getting the barcode value which means it is not a text. Can anyone help me how to parse this barcode values as an image or actual value? Thank you for reading this question.

Categories