How do I read a particular page (given a page number) from a PDF document using PDFBox?
This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );
//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}
Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}
you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);
Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();
Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.
Related
My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..
I am having html content store as a raw string in my database and I like to print it in pdf, but with custom size, for example page size to be 10cm width and 7 com height, not standard A4 format.
Can someone gives me some examples if it is possible.
ByteArrayOutputStream out = new ByteArrayOutputStream();
PDRectangle rec = new PDRectangle(recWidth, recHeight);
PDPage page = new PDPage(rec);
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
String htmlContent = "<b>Hello world</b>" + content;
builder.withHtmlContent(htmlContent, "");
document.addPage(page);
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
ex.printStackTrace();
}
return new ByteArrayInputStream(out.toByteArray());
This code generates for me 2 files, one small and one A4.
UPDATE:
I tried this one:
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
String htmlContent = "<b>content</b>";
builder.withHtmlContent(htmlContent, "");
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
log.error(">>> The creation of PDF is invalid!");
}
But in this case content is not shown, if I remove useDefaultPageSize, content will be shown
I didn't check this solution before, but try initialise the builder object with your desired page size and document type like below
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
the lib include many PDF format next is PdfAConformance Enum with possible values
PdfAConformance Enum
I am working with PDFBOX v2, I'm trying to clone the first PDPage of a PDDocument for keep it as template for new PDPages. That first page, has some acroform fields that I need fill.
I tried some methods but anyone makes I want to achieve.
1) Copy the first page content and add it to the document when I need a new page. That copy the page but the acroform field are linked with other pages fields, and if I modify field value from the first page, that shows in the other pages.
//Save in variable first page content
COSDictionary pageContent = (COSDictionary)doc.getPage(0).getCOSObject();
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
2) Clone the first page content and then add to the document like the first method. That copy the page but no field is copied :/
PDFCloneUtility cloner = new PDFCloneUtility(doc);
COSDictionary pageContent = (COSDictionary)cloner.cloneForNewDocument(doc.getPage(0).getCOSObject());
...
//when i need insert new page
doc.addPage(new PDPage(pageContent));
Then, what is the correct way to make a deep copy of a PDPage getting acroform fields independent from the first page?
Thanks!
I got the solution!
1) Start with an empty pdf template, only has 1 page. Open template document, fill common data and save as byte[] in memory.
PDDocument templatedoc = PDDocument.load(new File(path));
PDDocumentCatalog catalog = templatedoc.getDocumentCatalog();
PDAcroFrom acroForm = catalog.getAcroForm());
... fill acroForm common data of all pages ...
ByteArrayOutputStream basicTemplate = new ByteArrayOutputStream();
templatedoc.save(basicTemplate);
byte[] filledBasicTemplate = basicTemplate.toByteArray();
2) Generate new document for each needed page.
List<PDDocument> documents = new ArrayList<PDDocument>();
PDDocument activeDoc;
for(int i = 0; i < 5; i++) {
activeDoc = PDDocument.load(filledBasicTemplate);
documents.add(activeDoc);
... fill acroform or you need in each page ...
}
3) Import all new document first pages into final document and save final document.
PDDocument finalDoc = new PDDocument();
for(PDDocument currentDoc : documents) {
... fill common things like page numbers ...
finalDoc.importPage(currentDoc.getPage(0));
}
finalDoc.save(new File(path));
... close all documents after save the final document ...
It maybe not be the most optimized code, but it works.
I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?
I tried the below code:
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(randomAccessFile);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This is resulting me blank text though my pdf is having data.
I am having a pdf with barcodes. As shown in the image
I am stuck on an issue and not able to proceed with my project. I am using PDFBox for parsing of the PDF and able to convert the whole pdf in the text format as shown in the code below:
public static PdfValues readPdf() throws IOException {
System.out.println("Main Method Started");
File file = new File("D:/po/temp/output.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
text = text.trim();
text = text.replaceAll(" +", " ");
text = text.replaceAll("(?m)^[ \t]*\r?\n", "");
// System.out.println(text);
deleteIfExist();
writeToFile(text);
PdfValues infos = readData();
document.close();
System.out.println("Main Method Ended");
return infos;
}
But I am not getting the barcode value which means it is not a text. Can anyone help me how to parse this barcode values as an image or actual value? Thank you for reading this question.