I'm trying to extract all the images from pdf by using the below code, it work fine for all images except the images with color space indexed.
try (final PDDocument document = PDDocument.load(new File("./pdfs/22.pdf"))){
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
int i = 1;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(name);
if (o instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject)o;
String filename = OUTPUT_DIR + "extracted-image-" + i + ".png";
ImageIO.write(image.getImage(), "png", new File(filename));
i++;
}
}
}
} catch (IOException e){
System.err.println("Exception while trying to create pdf document - " + e);
}
Do i miss something? How can I extract such type of images??
Related
I ran into a very tough issue. We have forms that were supposed to be filled out, but some people used annotation freeform text comments in foxit instead of filling the form fields, so the annotations never flatten. When our render software generates the final document annotations are not included.
The solution I tried is to basically go through the document, get the annotation text content and write it to the pdf so it is on the final document then remove the actual annotation, but I run into an issue where I don't know the font the annotation is using, line space, etc so cannot find out how to get it from a pdfbox to recreate exacactly as the annotation looks on the unflattened form.
Basically I want to flatten annotatations that are freeform created in foxit (The typewriter comment feature)
Here is the code. It is working, but again I am struggling with figuring out how to get the annotations to write to my final pdf document. Again flatten on the acroform is not working because these are not acroform fields! The live code filters out anything that is not a freetext type annotation, but below code should show my issue.
public static void main(String [] args)
{
String startDoc = "C:/test2/test.pdf";
String finalFlat = "C:/test2/test_FLAT.pdf";
try {
// for testing
try {
//BasicConfigurator.configure();
File myFile = new File(startDoc);
PDDocument pdDoc = PDDocument.load( myFile );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
// set the NeedApperances flag
pdAcroForm.setNeedAppearances(false);
// correct the missing page link for the annotations
for (PDPage page : pdDoc.getPages()) {
for (PDAnnotation annot : page.getAnnotations()) {
System.out.println(annot.getContents());
System.out.println(annot.isPrinted());
System.out.println(annot.isLocked());
System.out.println(annot.getAppearance().toString());
PDPageContentStream contentStream = new PDPageContentStream(pdDoc, page, PDPageContentStream.AppendMode.APPEND,true,true);
int fontHeight = 14;
contentStream.setFont(PDType1Font.TIMES_ROMAN, fontHeight);
float height = annot.getRectangle().getLowerLeftY();
String s = annot.getContents().replaceAll("\t", " ");
String ss[] = s.split("\\r");
for(String sss : ss)
{
contentStream.beginText();
contentStream.newLineAtOffset(annot.getRectangle().getLowerLeftX(),height );
contentStream.showText(sss);
height = height + fontHeight * 2 ;
contentStream.endText();
}
contentStream.close();
page.getAnnotations().remove(annot);
}
}
pdAcroForm.flatten();
pdDoc.save(finalFlat);
pdDoc.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
This was not a fun one. After a million different tests, and I STILL do not understand all the nuances, but this is the version that appeas to flatten all pdf files and annotations if they are visible on PDF. Tested about half a dozen pdf creators and if an annotation is visible on a page this hopefully flattens it. I suspect there is a better way by pulling the matrix and transforming it and what not, but this is the only way I got it to work everywhere.
public static void flattenv3(String startDoc, String endDoc) {
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);
String finalFlat = endDoc;
try {
try {
//BasicConfigurator.configure();
File myFile = new File(startDoc);
PDDocument pdDoc = PDDocument.load(myFile);
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.setNeedAppearances(false);
pdAcroForm.flatten();
}
// set the NeedApperances flag
boolean isContentStreamWrapped;
int ii = 0;
for (PDPage page: pdDoc.getPages()) {
PDPageContentStream contentStream;
isContentStreamWrapped = false;
List < PDAnnotation > annotations = new ArrayList < > ();
for (PDAnnotation annotation: page.getAnnotations()) {
if (!annotation.isInvisible() && !annotation.isHidden() && annotation.getNormalAppearanceStream() != null)
{
ii++;
if (ii > 1) {
// contentStream.close();
// continue;
}
if (!isContentStreamWrapped) {
contentStream = new PDPageContentStream(pdDoc, page, AppendMode.APPEND, true, true);
isContentStreamWrapped = true;
} else {
contentStream = new PDPageContentStream(pdDoc, page, AppendMode.APPEND, true);
}
PDAppearanceStream appearanceStream = annotation.getNormalAppearanceStream();
PDFormXObject fieldObject = new PDFormXObject(appearanceStream.getCOSObject());
contentStream.saveGraphicsState();
boolean needsTranslation = resolveNeedsTranslation(appearanceStream);
Matrix transformationMatrix = new Matrix();
boolean transformed = false;
float lowerLeftX = annotation.getNormalAppearanceStream().getBBox().getLowerLeftX();
float lowerLeftY = annotation.getNormalAppearanceStream().getBBox().getLowerLeftY();
PDRectangle bbox = appearanceStream.getBBox();
PDRectangle fieldRect = annotation.getRectangle();
float xScale = fieldRect.getWidth() - bbox.getWidth();
transformed = true;
lowerLeftX = fieldRect.getLowerLeftX();
lowerLeftY = fieldRect.getLowerLeftY();
if (bbox.getLowerLeftX() <= 0 && bbox.getLowerLeftY() < 0 && Math.abs(xScale) < 1) //BASICALLY EQUAL TO 0 WITH ROUNDING
{
lowerLeftY = fieldRect.getLowerLeftY() - bbox.getLowerLeftY();
if (bbox.getLowerLeftX() < 0 && bbox.getLowerLeftY() < 0) //THis is for the o
{
lowerLeftX = lowerLeftX - bbox.getLowerLeftX();
}
} else if (bbox.getLowerLeftX() == 0 && bbox.getLowerLeftY() < 0 && xScale >= 0) {
lowerLeftX = fieldRect.getUpperRightX();
} else if (bbox.getLowerLeftY() <= 0 && xScale >= 0) {
lowerLeftY = fieldRect.getLowerLeftY() - bbox.getLowerLeftY() - xScale;
} else if (bbox.getUpperRightY() <= 0) {
if (annotation.getNormalAppearanceStream().getMatrix().getShearY() < 0) {
lowerLeftY = fieldRect.getUpperRightY();
lowerLeftX = fieldRect.getUpperRightX();
}
} else {
}
transformationMatrix.translate(lowerLeftX,
lowerLeftY);
contentStream.transform(transformationMatrix);
contentStream.drawForm(fieldObject);
contentStream.restoreGraphicsState();
contentStream.close();
}
}
page.setAnnotations(annotations);
}
pdDoc.save(finalFlat);
pdDoc.close();
File file = new File(finalFlat);
// Desktop.getDesktop().browse(file.toURI());
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
}
Requirement:
My pdf has 5 pages, I need to start the split on the second page and end at the last page, in a single pdf. Currently, I have made it so that it splits one pdf per page.
Current code:
public static boolean SepararFC(String sequence, String pathfrom, String pathto) {
try (PDDocument document = PDDocument.load(new File(pathfrom))) {
Splitter splitter = new Splitter();
List<PDDocument> Pages = splitter.split(document);
for (int i = 0; i < Pages.size(); i++) {
PDDocument doc = Pages.get(i);
doc.save(new File(pathfrom, sequence + "_" + i + ".pdf"));
}
} catch (IOException e){
System.err.println(e);
}
return true;
Since you don't want page 1, you can just remove it and save a new file
try (PDDocument document = PDDocument.load(new File(pathfrom))) {
//Listing the number of existing pages
int noOfPages= document.getNumberOfPages();
System.out.print(noOfPages);
//Removing the pages
document.removePage(0);
System.out.println("page removed");
//Saving the document
document.save(new File(pathfrom, sequence + "new"+ ".pdf"));
//Closing the document
document.close();}
catch (IOException e){
System.err.println(e);
}
Assuming your code is correct:
public static boolean SepararFC(String sequence, String pathfrom, String pathto) {
int start = 1;
int end = 4;
try (PDDocument document = PDDocument.load(new File(pathfrom))) {
Splitter splitter = new Splitter();
List<PDDocument> Pages = splitter.split(document);
for (int i = 0; i < Pages.size(); i++) {
PDDocument doc = Pages.get(i);
if(i > start && i <= end){
doc.save(new File(pathfrom, sequence + "_" + i + ".pdf"));
}
}
} catch (IOException e){
System.err.println(e);
}
return true;
I have PDF document with some images and one placeholder where I have to put image. My problem is that the image always has size of placeholder when i run program. Where I am wrong?
Image size is (320x120).
PDDocument pdfDocument = PDDocument.load(pdfDocumentBase64);
PDImageXObject editObject =LosslessFactory.createFromImage(pdfDocument, signatureImage);
PDPageTree list = pdfDocument.getPages();
pdfDocument.getNumberOfPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName name : pdResources.getXObjectNames()) {
PDImageXObject o = (PDImageXObject) pdResources.getXObject(name);
if (o instanceof PDImageXObject && o.getHeight() < 100) {
int a = o.getWidth(); // 10 --- placeholder
int b = o.getHeight(); // 10 ---- placeholder
pdResources.put(name,editObject);
}
}
}
I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.
Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!
Here is the code I'm using to convert PDF pages to images:
private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
for (PDPage pdPage : pdPages) {
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
}
document.close();
}
Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?
Much appreciated for any help!
EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf
The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.
Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.
PDFBox 1.8.10
You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
if (pages == null)
return;
for (int i = 0; i < pages.size(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}
(ExtractPatternImages.java)
I applied it to your sample PDF like this
public void testtestDrJorge() throws IOException
{
try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
{
PDDocument document = PDDocument.load(resource);
extractPatternImages(document, "testDrJorge%s.%s");;
}
}
(ExtractPatternImages.java)
and got two images:
`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png
The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.
PDFBox 2.0.0 release candidate
I updated the code to PDFBox 2.0.0 (currently available as release candidate only):
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
PDPageTree pages = document.getDocumentCatalog().getPages();
if (pages == null)
return;
for (int i = 0; i < pages.getCount(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
String filename = String.format(imageFormat, "", image.getSuffix());
ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}
and get
testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png
Looks like an improvement... ;)
I am using iText-Java to split PDFs at bookmark level.
Does anybody know or have any examples for splitting a PDF at bookmarks that exist at a level 2 or 3?
For ex: I have the bookmarks in the following levels:
Father
|-Son
|-Son
|-Daughter
|-|-Grand son
|-|-Grand daughter
Right now I have below code to read the bookmark which reads the base bookmark(Father). Basically SimpleBookmark.getBookmark(reader) line did all the work.
But I want to read the level 2 and level 3 bookmarks to split the content present between those inner level bookmarks.
public static void splitPDFByBookmarks(String pdf, String outputFolder){
try
{
PdfReader reader = new PdfReader(pdf);
//List of bookmarks: each bookmark is a map with values for title, page, etc
List<HashMap> bookmarks = SimpleBookmark.getBookmark(reader);
for(int i=0; i<bookmarks.size(); i++){
HashMap bm = bookmarks.get(i);
HashMap nextBM = i==bookmarks.size()-1 ? null : bookmarks.get(i+1);
//In my case I needed to split the title string
String title = ((String)bm.get("Title")).split(" ")[2];
log.debug("Titel: " + title);
String startPage = ((String)bm.get("Page")).split(" ")[0];
String startPageNextBM = nextBM==null ? "" + (reader.getNumberOfPages() + 1) : ((String)nextBM.get("Page")).split(" ")[0];
log.debug("Page: " + startPage);
log.debug("------------------");
extractBookmarkToPDF(reader, Integer.valueOf(startPage), Integer.valueOf(startPageNextBM), title + ".pdf",outputFolder);
}
}
catch (IOException e)
{
log.error(e.getMessage());
}
}
private static void extractBookmarkToPDF(PdfReader reader, int pageFrom, int pageTo, String outputName, String outputFolder){
Document document = new Document();
OutputStream os = null;
try{
os = new FileOutputStream(outputFolder + outputName);
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, os);
document.open();
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF data
PdfImportedPage page;
while(pageFrom < pageTo) {
document.newPage();
page = writer.getImportedPage(reader, pageFrom);
cb.addTemplate(page, 0, 0);
pageFrom++;
}
os.flush();
document.close();
os.close();
}catch(Exception ex){
log.error(ex.getMessage());
}finally {
if (document.isOpen())
document.close();
try {
if (os != null)
os.close();
} catch (IOException ioe) {
log.error(ioe.getMessage());
}
}
}
Your help is much appreciated.
Thanks in advance! :)
You get an ArrayList<HashMap> when you call SimpleBookmark.getBookmark(reader); (do the cast if you need it). Try to iterate through that Arraylist and see its structure. If a bookmarks have sons (as you call it), it will contains another list with the same structure.
A recursive method could be the solution.
Reference for those who are looking at this using itext7
public void walkOutlines(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) { //----------loop traversing all paths
for (PdfOutline child : outline.getAllChildren()){
if(child.getDestination() != null) {
prepareIndexFile(child,names,pdfDocument,titles,pageNum,list);
}
}
}
//-----Getting pageNumbers from outlines
public void prepareIndexFile(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) {
String title = outline.getTitle();
PdfDestination pdfDestination = outline.getDestination();
String pdfStr = ((PdfString)pdfDestination.getPdfObject()).toUnicodeString();
PdfArray array = (PdfArray) names.get(pdfStr);
PdfObject pdfObj = array != null ? array.get(0) : null;
Integer pageNumber = pdfDocument.getPageNumber((PdfDictionary)pdfObj);
titles.add(title);
pageNum.add(pageNumber);
if(outline.getAllChildren().size() > 0) {
for (PdfOutline child : outline.getAllChildren()){
prepareIndexFile(child,names,pdfDocument,titles,pageNum);
}
}
}
public boolean splitPdf(String inputFile, final String outputFolder) {
boolean splitSuccess = true;
PdfDocument pdfDoc = null;
try {
PdfReader pdfReaderNew = new PdfReader(inputFile);
pdfDoc = new PdfDocument(pdfReaderNew);
final List<String> titles = new ArrayList<String>();
List<Integer> pageNum = new ArrayList<Integer>();
PdfNameTree destsTree = pdfDoc.getCatalog().getNameTree(PdfName.Dests);
Map<String, PdfObject> names = destsTree.getNames();//--------------------------------------Core logic for getting names
PdfOutline root = pdfDoc.getOutlines(false);//--------------------------------------Core logic for getting outlines
walkOutlines(root,names, pdfDoc, titles, pageNum,content); //------Logic to get bookmarks and pageNumbers
if (titles == null || titles.size()==0) {
splitSuccess = false;
}else { //------Proceed if it has bookmarks
for(int i=0;i<titles.size();i++) {
String title = titles.get(i);
String startPageNmStr =""+pageNum.get(i);
int startPage = Integer.parseInt(startPageNmStr);
int endPage = startPage;
if(i == titles.size() - 1) {
endPage = pdfDoc.getNumberOfPages();
}else {
int nextPage = pageNum.get(i+1);
if(nextPage > startPage) {
endPage = nextPage - 1;
}else {
endPage = nextPage;
}
}
String outFileName = outputFolder + File.separator + getFileName(title) + ".pdf";
PdfWriter pdfWriter = new PdfWriter(outFileName);
PdfDocument newDocument = new PdfDocument(pdfWriter, new DocumentProperties().setEventCountingMetaInfo(null));
pdfDoc.copyPagesTo(startPage, endPage, newDocument);
newDocument.close();
pdfWriter.close();
}
}
}catch(Exception e){
//---log
}
}