Splitting a large Pdf file with PDFBox gets large result files

Splitting a large Pdf file with PDFBox gets large result files - java

I am processing some large pdf files, (up to 100MB and about 2000 pages), with pdfbox. Some of the pages contain a QR code, I want to split those files into smaller ones with the pages from one QR code to the next.
I got this, but the result file sizes are the same as the source file. I mean, if I cut a 100MB pdf file into a ten files I am getting ten files 100MB each.
This is the code:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("myFile.pdf"),
new RandomAccessFile(new File("./tmp/temp"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage page = (PDPage) pages.get(i);
BufferedImage firstPageImage =
page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
if(qrText != null and i!=0){
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
outputDocument.importPage((PDPage)pages.get(j));
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
documentoPdf.close();
}
I also tried the following code for storing the new file:
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
PDStream src = ((PDPage)pages.get(j)).getContents();
PDStream streamD = new PDStream(outputDocument);
streamD.addCompression();
PDPage newPage = new PDPage(new
COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
newPage.setContents(streamD);
byte[] buf = new byte[10240];
int amountRead = 0;
InputStream is = null;
OutputStream os = null;
is = src.createInputStream();
os = streamD.createOutputStream();
while((amountRead = is.read(buf,0,10240)) > -1) {
os.write(buf, 0, amountRead);
}
outputDocument.addPage(newPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
But this code creates files which lacks some content and also have the same size than the original.
How can I create smaller pdfs files from a larger one?
Is it posible with PDFBox? Is there any other library with which I can transform a single page into an image (for qr recognition), and also allows me to split a big pdf file into smaller ones?
Thx!

Thx! Tilman you are right, the PDFSplit command generates smaller files. I checked the PDFSplit code out and found that it removes the page links to avoid not needed resources.
Code extracted from Splitter.class :
private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
else
{
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
}
So eventually my code looks like this:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage firstPage = (PDPage) pages.get(i);
String qrText ="";
BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
firstPage =null;
try {
qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
} catch (NotFoundException e) {
e.printStackTrace();
} finally {
firstPageImage = null;
}
if(i != 0 && qrText!=null){
PDDocument outputDocument = new PDDocument();
outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
outputDocument.getDocumentCatalog().setViewerPreferences(
documentoPdf.getDocumentCatalog().getViewerPreferences());
for(int j = previusQR; j<i; j++){
PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));
importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
// only the resources of the page will be copied
importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );
processAnnotations(importedPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
previusQR = i;
outputDocument.save(f);
outputDocument.close();
}
}
}
Thank you very much!!

Related

PDFBOX 2.0+ java flatten annotations freetext created by foxit

I ran into a very tough issue. We have forms that were supposed to be filled out, but some people used annotation freeform text comments in foxit instead of filling the form fields, so the annotations never flatten. When our render software generates the final document annotations are not included.
The solution I tried is to basically go through the document, get the annotation text content and write it to the pdf so it is on the final document then remove the actual annotation, but I run into an issue where I don't know the font the annotation is using, line space, etc so cannot find out how to get it from a pdfbox to recreate exacactly as the annotation looks on the unflattened form.
Basically I want to flatten annotatations that are freeform created in foxit (The typewriter comment feature)
Here is the code. It is working, but again I am struggling with figuring out how to get the annotations to write to my final pdf document. Again flatten on the acroform is not working because these are not acroform fields! The live code filters out anything that is not a freetext type annotation, but below code should show my issue.
public static void main(String [] args)
{
String startDoc = "C:/test2/test.pdf";
String finalFlat = "C:/test2/test_FLAT.pdf";
try {
// for testing
try {
//BasicConfigurator.configure();
File myFile = new File(startDoc);
PDDocument pdDoc = PDDocument.load( myFile );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
// set the NeedApperances flag
pdAcroForm.setNeedAppearances(false);
// correct the missing page link for the annotations
for (PDPage page : pdDoc.getPages()) {
for (PDAnnotation annot : page.getAnnotations()) {
System.out.println(annot.getContents());
System.out.println(annot.isPrinted());
System.out.println(annot.isLocked());
System.out.println(annot.getAppearance().toString());
PDPageContentStream contentStream = new PDPageContentStream(pdDoc, page, PDPageContentStream.AppendMode.APPEND,true,true);
int fontHeight = 14;
contentStream.setFont(PDType1Font.TIMES_ROMAN, fontHeight);
float height = annot.getRectangle().getLowerLeftY();
String s = annot.getContents().replaceAll("\t", " ");
String ss[] = s.split("\\r");
for(String sss : ss)
{
contentStream.beginText();
contentStream.newLineAtOffset(annot.getRectangle().getLowerLeftX(),height );
contentStream.showText(sss);
height = height + fontHeight * 2 ;
contentStream.endText();
}
contentStream.close();
page.getAnnotations().remove(annot);
}
}
pdAcroForm.flatten();
pdDoc.save(finalFlat);
pdDoc.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}

This was not a fun one. After a million different tests, and I STILL do not understand all the nuances, but this is the version that appeas to flatten all pdf files and annotations if they are visible on PDF. Tested about half a dozen pdf creators and if an annotation is visible on a page this hopefully flattens it. I suspect there is a better way by pulling the matrix and transforming it and what not, but this is the only way I got it to work everywhere.
public static void flattenv3(String startDoc, String endDoc) {
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);
String finalFlat = endDoc;
try {
try {
//BasicConfigurator.configure();
File myFile = new File(startDoc);
PDDocument pdDoc = PDDocument.load(myFile);
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.setNeedAppearances(false);
pdAcroForm.flatten();
}
// set the NeedApperances flag
boolean isContentStreamWrapped;
int ii = 0;
for (PDPage page: pdDoc.getPages()) {
PDPageContentStream contentStream;
isContentStreamWrapped = false;
List < PDAnnotation > annotations = new ArrayList < > ();
for (PDAnnotation annotation: page.getAnnotations()) {
if (!annotation.isInvisible() && !annotation.isHidden() && annotation.getNormalAppearanceStream() != null)
{
ii++;
if (ii > 1) {
// contentStream.close();
// continue;
}
if (!isContentStreamWrapped) {
contentStream = new PDPageContentStream(pdDoc, page, AppendMode.APPEND, true, true);
isContentStreamWrapped = true;
} else {
contentStream = new PDPageContentStream(pdDoc, page, AppendMode.APPEND, true);
}
PDAppearanceStream appearanceStream = annotation.getNormalAppearanceStream();
PDFormXObject fieldObject = new PDFormXObject(appearanceStream.getCOSObject());
contentStream.saveGraphicsState();
boolean needsTranslation = resolveNeedsTranslation(appearanceStream);
Matrix transformationMatrix = new Matrix();
boolean transformed = false;
float lowerLeftX = annotation.getNormalAppearanceStream().getBBox().getLowerLeftX();
float lowerLeftY = annotation.getNormalAppearanceStream().getBBox().getLowerLeftY();
PDRectangle bbox = appearanceStream.getBBox();
PDRectangle fieldRect = annotation.getRectangle();
float xScale = fieldRect.getWidth() - bbox.getWidth();
transformed = true;
lowerLeftX = fieldRect.getLowerLeftX();
lowerLeftY = fieldRect.getLowerLeftY();
if (bbox.getLowerLeftX() <= 0 && bbox.getLowerLeftY() < 0 && Math.abs(xScale) < 1) //BASICALLY EQUAL TO 0 WITH ROUNDING
{
lowerLeftY = fieldRect.getLowerLeftY() - bbox.getLowerLeftY();
if (bbox.getLowerLeftX() < 0 && bbox.getLowerLeftY() < 0) //THis is for the o
{
lowerLeftX = lowerLeftX - bbox.getLowerLeftX();
}
} else if (bbox.getLowerLeftX() == 0 && bbox.getLowerLeftY() < 0 && xScale >= 0) {
lowerLeftX = fieldRect.getUpperRightX();
} else if (bbox.getLowerLeftY() <= 0 && xScale >= 0) {
lowerLeftY = fieldRect.getLowerLeftY() - bbox.getLowerLeftY() - xScale;
} else if (bbox.getUpperRightY() <= 0) {
if (annotation.getNormalAppearanceStream().getMatrix().getShearY() < 0) {
lowerLeftY = fieldRect.getUpperRightY();
lowerLeftX = fieldRect.getUpperRightX();
}
} else {
}
transformationMatrix.translate(lowerLeftX,
lowerLeftY);
contentStream.transform(transformationMatrix);
contentStream.drawForm(fieldObject);
contentStream.restoreGraphicsState();
contentStream.close();
}
}
page.setAnnotations(annotations);
}
pdDoc.save(finalFlat);
pdDoc.close();
File file = new File(finalFlat);
// Desktop.getDesktop().browse(file.toURI());
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
}

PDFBox: put two A4 pages on one A3

I have a pdf document with one or more pages A4 paper.
The resulting pdf document should be A3 paper where each page contains two from the first one (odd on the left, even on the right side).
I already got it to render the A4 pages into images and the odd pages are successfully placed on the first parts of a new A3 pages but I cannot get the even pages to be placed.
public class CreateLandscapePDF {
public void renderPDF(File inputFile, String output) {
PDDocument docIn = null;
PDDocument docOut = null;
float width = 0;
float height = 0;
float posX = 0;
float posY = 0;
try {
docIn = PDDocument.load(inputFile);
PDFRenderer pdfRenderer = new PDFRenderer(docIn);
docOut = new PDDocument();
int pageCounter = 0;
for(PDPage pageIn : docIn.getPages()) {
pageIn.setRotation(270);
BufferedImage bufferedImage = pdfRenderer.renderImage(pageCounter);
width = bufferedImage.getHeight();
height = bufferedImage.getWidth();
PDPage pageOut = new PDPage(PDRectangle.A3);
PDImageXObject image = LosslessFactory.createFromImage(docOut, bufferedImage);
PDPageContentStream contentStream = new PDPageContentStream(docOut, pageOut, AppendMode.APPEND, true, true);
if((pageCounter & 1) == 0) {
pageOut.setRotation(90);
docOut.addPage(pageOut);
posX = 0;
posY = 0;
} else {
posX = 0;
posY = width;
}
contentStream.drawImage(image, posX, posY);
contentStream.close();
bufferedImage.flush();
pageCounter++;
}
docOut.save(output + "\\LandscapeTest.pdf");
docOut.close();
docIn.close();
} catch(IOException io) {
io.printStackTrace();
}
}
}
I'm using Apache PDFBox 2.0.2 (pdfbox-app-2.0.2.jar)

Thank you very much for your help and the link to the other question - I think I already read it but wasn't able to use in in my code yet.
But finally the PDFClown made the job, though I think it's not very nice to use PDFBox and PDFClown in the same program.
Anyway here's my working code to combine A4 pages on A3 paper.
public class CombinePages {
public void run(String input, String output) {
try {
Document source = new File(input).getDocument();
Pages sourcePages = source.getPages();
Document target = new File().getDocument();
Page targetPage = null;
int pageCounter = 0;
double moveByX = .0;
for(Page sourcePage : source.getPages()) {
if((pageCounter & 1) == 0) {
//even page gets a blank page
targetPage = new Page(target);
target.setPageSize(PageFormat.getSize(PageFormat.SizeEnum.A3, PageFormat.OrientationEnum.Landscape));
target.getPages().add(targetPage);
moveByX = .0;
} else {
moveByX = .50;
}
//get content from source page
XObject xObject = sourcePages.get(pageCounter).toXObject(target);
PrimitiveComposer composer = new PrimitiveComposer(targetPage);
Dimension2D targetSize = targetPage.getSize();
Dimension2D sourceSize = xObject.getSize();
composer.showXObject(xObject, new Point2D.Double(targetSize.getWidth() * moveByX, targetSize.getHeight() * .0), new Dimension(sourceSize.getWidth(), sourceSize.getHeight()), XAlignmentEnum.Left, YAlignmentEnum.Top, 0);
composer.flush();
pageCounter++;
}
target.getFile().save(output + "\\CombinePages.pdf", SerializationModeEnum.Standard);
source.getFile().close();
} catch (FileNotFoundException fnf) {
log.error(fnf);
} catch (IOException io) {
log.error(io);
}
}
}

How to find blank pages inside a PDF using PDFBox?

Here is the challenge I'm currently facing.
I have a lot of PDFs and I have to remove the blank pages inside them and display only the pages with content (text or images).
The problem is that those pdfs are scanned documents.
So the blank pages have some dirty left behind by the scanner.

I did some research and ended up with this code that checks for 99% of the page as white or light gray.
I needed the gray factor as the scanned documents sometimes are not pure white.
private static Boolean isBlank(PDPage pdfPage) throws IOException {
BufferedImage bufferedImage = pdfPage.convertToImage();
long count = 0;
int height = bufferedImage.getHeight();
int width = bufferedImage.getWidth();
Double areaFactor = (width * height) * 0.99;
for (int x = 0; x < width ; x++) {
for (int y = 0; y < height ; y++) {
Color c = new Color(bufferedImage.getRGB(x, y));
// verify light gray and white
if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue()
&& c.getRed() >= 248) {
count++;
}
}
}
if (count >= areaFactor) {
return true;
}
return false;
}

#Shoyo's code works fine for PDFBox version < 2.0. For future readers, there's no much change but, just in case, here is the code for PDFBOX 2.0+ to make your life easier.
In your main (By main, I mean the place where you are loading your PDF into PDDocument) method:
try {
PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/tetml_ct_access/C.pdf"));
PDFRenderer renderedDoc = new PDFRenderer(document);
for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
if(isBlank(renderedDoc.renderImage(pageNumber))) {
System.out.println("Blank Page Number : " + pageNumber + 1);
}
}
} catch (Exception e) {
e.printStackTrace();
}
And isBlank method will just have BufferedImage passed in:
private static Boolean isBlank(BufferedImage pageImage) throws IOException {
BufferedImage bufferedImage = pageImage;
long count = 0;
int height = bufferedImage.getHeight();
int width = bufferedImage.getWidth();
Double areaFactor = (width * height) * 0.99;
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
Color c = new Color(bufferedImage.getRGB(x, y));
if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue() && c.getRed() >= 248) {
count++;
}
}
}
if (count >= areaFactor) {
return true;
}
return false;
}
All the credits goes to #Shoyo
Update:
Some PDFs have "This Page was Intentionally Left Blank" to which the above code considers as blank. If this is your requirement then feel free to use the above code. But, my requirement was only to filter out the pages that were completely blank (No any images present nor consisting of any fonts). So, I ended up using this code (Plus this code runs faster :P) :
public static void main(String[] args) {
try {
PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/CTP2040.pdf"));
PDPageTree allPages = document.getPages();
Integer pageNumber = 1;
for (PDPage page : allPages) {
Iterable<COSName> xObjects = page.getResources().getXObjectNames();
Iterable<COSName> fonts = page.getResources().getFontNames();
if(xObjects.spliterator().getExactSizeIfKnown() == 0 && fonts.spliterator().getExactSizeIfKnown() == 0) {
System.out.println(pageNumber);
}
pageNumber++;
}
} catch (Exception e) {
e.printStackTrace();
}
}
This will return the page numbers of those pages which are completely blank.
Hope this helps someone! :)

#Pramesh Bajracharya, Your solution to find a blank page in a PDF document is intact!
If in case the requirement is to remove the blank pages the same code can be enhanced as below
List<Integer> blankPageList = new ArrayList<Integer>();
for( PDPage page : allPages )
{
Iterable<COSName> xObjects = page.getResources().getXObjectNames();
Iterable<COSName> fonts = page.getResources().getFontNames();
// condition to determine if the page is a blank page
if( xObjects.spliterator().getExactSizeIfKnown() == 0 && fonts.spliterator().getExactSizeIfKnown() == 0 )
{
pageRemovalList.add( pageNumber );
}
pageNumber++;
}
// remove the blank pages from the pdf document using the blank page numbers list
for( Integer i : blankPageList )
{
document.removePage( i );
}

http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
public class RemoveBlankPageFromPDF {
// value where we can consider that this is a blank image
// can be much higher or lower depending of what is considered as a blank page
public static final int BLANK_THRESHOLD = 160;
public static void removeBlankPdfPages(String source, String destination)
throws IOException, DocumentException
{
PdfReader r = null;
RandomAccessSourceFactory rasf = null;
RandomAccessFileOrArray raf = null;
Document document = null;
PdfCopy writer = null;
try {
r = new PdfReader(source);
// deprecated
// RandomAccessFileOrArray raf
// = new RandomAccessFileOrArray(pdfSourceFile);
// itext 5.4.1
rasf = new RandomAccessSourceFactory();
raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
document = new Document(r.getPageSizeWithRotation(1));
writer = new PdfCopy(document, new FileOutputStream(destination));
document.open();
PdfImportedPage page = null;
for (int i=1; i<=r.getNumberOfPages(); i++) {
// first check, examine the resource dictionary for /Font or
// /XObject keys. If either are present -> not blank.
PdfDictionary pageDict = r.getPageN(i);
PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
boolean noFontsOrImages = true;
if (resDict != null) {
noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
resDict.get( PdfName.XOBJECT ) == null;
}
System.out.println(i + " noFontsOrImages " + noFontsOrImages);
if (!noFontsOrImages) {
byte bContent [] = r.getPageContent(i,raf);
ByteArrayOutputStream bs = new ByteArrayOutputStream();
bs.write(bContent);
System.out.println
(i + bs.size() + " > BLANK_THRESHOLD " + (bs.size() > BLANK_THRESHOLD));
if (bs.size() > BLANK_THRESHOLD) {
page = writer.getImportedPage(r, i);
writer.addPage(page);
}
}
}
}
finally {
if (document != null) document.close();
if (writer != null) writer.close();
if (raf != null) raf.close();
if (r != null) r.close();
}
}
public static void main (String ... args) throws Exception {
removeBlankPdfPages
("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
}
}

How to read bookmarks in PDF using itext at multi level?

I am using iText-Java to split PDFs at bookmark level.
Does anybody know or have any examples for splitting a PDF at bookmarks that exist at a level 2 or 3?
For ex: I have the bookmarks in the following levels:
Father
|-Son
|-Son
|-Daughter
|-|-Grand son
|-|-Grand daughter
Right now I have below code to read the bookmark which reads the base bookmark(Father). Basically SimpleBookmark.getBookmark(reader) line did all the work.
But I want to read the level 2 and level 3 bookmarks to split the content present between those inner level bookmarks.
public static void splitPDFByBookmarks(String pdf, String outputFolder){
try
{
PdfReader reader = new PdfReader(pdf);
//List of bookmarks: each bookmark is a map with values for title, page, etc
List<HashMap> bookmarks = SimpleBookmark.getBookmark(reader);
for(int i=0; i<bookmarks.size(); i++){
HashMap bm = bookmarks.get(i);
HashMap nextBM = i==bookmarks.size()-1 ? null : bookmarks.get(i+1);
//In my case I needed to split the title string
String title = ((String)bm.get("Title")).split(" ")[2];
log.debug("Titel: " + title);
String startPage = ((String)bm.get("Page")).split(" ")[0];
String startPageNextBM = nextBM==null ? "" + (reader.getNumberOfPages() + 1) : ((String)nextBM.get("Page")).split(" ")[0];
log.debug("Page: " + startPage);
log.debug("------------------");
extractBookmarkToPDF(reader, Integer.valueOf(startPage), Integer.valueOf(startPageNextBM), title + ".pdf",outputFolder);
}
}
catch (IOException e)
{
log.error(e.getMessage());
}
}
private static void extractBookmarkToPDF(PdfReader reader, int pageFrom, int pageTo, String outputName, String outputFolder){
Document document = new Document();
OutputStream os = null;
try{
os = new FileOutputStream(outputFolder + outputName);
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, os);
document.open();
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF data
PdfImportedPage page;
while(pageFrom < pageTo) {
document.newPage();
page = writer.getImportedPage(reader, pageFrom);
cb.addTemplate(page, 0, 0);
pageFrom++;
}
os.flush();
document.close();
os.close();
}catch(Exception ex){
log.error(ex.getMessage());
}finally {
if (document.isOpen())
document.close();
try {
if (os != null)
os.close();
} catch (IOException ioe) {
log.error(ioe.getMessage());
}
}
}
Your help is much appreciated.
Thanks in advance! :)

You get an ArrayList<HashMap> when you call SimpleBookmark.getBookmark(reader); (do the cast if you need it). Try to iterate through that Arraylist and see its structure. If a bookmarks have sons (as you call it), it will contains another list with the same structure.
A recursive method could be the solution.

Reference for those who are looking at this using itext7
public void walkOutlines(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) { //----------loop traversing all paths
for (PdfOutline child : outline.getAllChildren()){
if(child.getDestination() != null) {
prepareIndexFile(child,names,pdfDocument,titles,pageNum,list);
}
}
}
//-----Getting pageNumbers from outlines
public void prepareIndexFile(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) {
String title = outline.getTitle();
PdfDestination pdfDestination = outline.getDestination();
String pdfStr = ((PdfString)pdfDestination.getPdfObject()).toUnicodeString();
PdfArray array = (PdfArray) names.get(pdfStr);
PdfObject pdfObj = array != null ? array.get(0) : null;
Integer pageNumber = pdfDocument.getPageNumber((PdfDictionary)pdfObj);
titles.add(title);
pageNum.add(pageNumber);
if(outline.getAllChildren().size() > 0) {
for (PdfOutline child : outline.getAllChildren()){
prepareIndexFile(child,names,pdfDocument,titles,pageNum);
}
}
}
public boolean splitPdf(String inputFile, final String outputFolder) {
boolean splitSuccess = true;
PdfDocument pdfDoc = null;
try {
PdfReader pdfReaderNew = new PdfReader(inputFile);
pdfDoc = new PdfDocument(pdfReaderNew);
final List<String> titles = new ArrayList<String>();
List<Integer> pageNum = new ArrayList<Integer>();
PdfNameTree destsTree = pdfDoc.getCatalog().getNameTree(PdfName.Dests);
Map<String, PdfObject> names = destsTree.getNames();//--------------------------------------Core logic for getting names
PdfOutline root = pdfDoc.getOutlines(false);//--------------------------------------Core logic for getting outlines
walkOutlines(root,names, pdfDoc, titles, pageNum,content); //------Logic to get bookmarks and pageNumbers
if (titles == null || titles.size()==0) {
splitSuccess = false;
}else { //------Proceed if it has bookmarks
for(int i=0;i<titles.size();i++) {
String title = titles.get(i);
String startPageNmStr =""+pageNum.get(i);
int startPage = Integer.parseInt(startPageNmStr);
int endPage = startPage;
if(i == titles.size() - 1) {
endPage = pdfDoc.getNumberOfPages();
}else {
int nextPage = pageNum.get(i+1);
if(nextPage > startPage) {
endPage = nextPage - 1;
}else {
endPage = nextPage;
}
}
String outFileName = outputFolder + File.separator + getFileName(title) + ".pdf";
PdfWriter pdfWriter = new PdfWriter(outFileName);
PdfDocument newDocument = new PdfDocument(pdfWriter, new DocumentProperties().setEventCountingMetaInfo(null));
pdfDoc.copyPagesTo(startPage, endPage, newDocument);
newDocument.close();
pdfWriter.close();
}
}
}catch(Exception e){
//---log
}
}

How to combine multiple multi-page tif files into a single tif

I am trying to take multiple multi-page .tif files and combine them into a single multi-page tif file.
I found some code in this question, but it only seems to take the first page of each individual .tif file and create the new multi-page .tif with those first pages.
Is there a small change I'm not seeing that would cause this same code to grab every page from the source .tif files and put them all into the combined .tif?
To clarify, I would like the source files:
SourceA.tif (3 pages)
SourceB.tif (4 pages)
SourceC.tif (1 page)
to be combined into
combined.tif (8 pages)
I would also like to be able to specify a resolution and compression of the .tif, but I'm not sure if JAI supports that and it's not a necessity for a correct answer.
The code from the referenced question, modified by me to load all the .tif files in a directory, is below for easy answering:
public static void main(String[] args) {
String inputDir = "C:\\tifSources";
File sourceDirectory = new File(inputDir);
File file[] = sourceDirectory.listFiles();
int numImages = file.length;
BufferedImage image[] = new BufferedImage[numImages];
try
{
for (int i = 0; i < numImages; i++)
{
SeekableStream ss = new FileSeekableStream(file[i]);
ImageDecoder decoder = ImageCodec.createImageDecoder("tiff", ss, null);
PlanarImage op = new NullOpImage(decoder.decodeAsRenderedImage(0), null, null, OpImage.OP_IO_BOUND);
image[i] = op.getAsBufferedImage();
}
TIFFEncodeParam params = new TIFFEncodeParam();
OutputStream out = new FileOutputStream(inputDir + "\\combined.tif");
ImageEncoder encoder = ImageCodec.createImageEncoder("tiff", out, params);
List<BufferedImage> imageList = new ArrayList<BufferedImage>();
for (int i = 0; i < numImages; i++)
{
imageList.add(image[i]);
}
params.setExtraImages(imageList.iterator());
encoder.encode(image[0]);
out.close();
}
catch (Exception e)
{
System.out.println("Exception " + e);
}
}

I knew I was just missing some little part about iterating over the pages in a single .tif, I just wasn't sure where it was.
More searching on the internet led me to find that rather than doing:
PlanarImage op = new NullOpImage(decoder.decodeAsRenderedImage(0), null, null, OpImage.OP_IO_BOUND);
I wanted to iterate over every page in the current document with something like:
int numPages = decoder.getNumPages();
for(int j = 0; j < numPages; j++)
{
PlanarImage op = new NullOpImage(decoder.decodeAsRenderedImage(j), null, null, OpImage.OP_IO_BOUND);
images.add(op.getAsBufferedImage());
}
This adds every page of every .tif into the images List. One final trap was that the final call to
encoder.encode(images.get(0));
Would cause the first page to be in the new .tif twice, so I added an intermediate loop and List population that doesn't add the first page in the call to:
params.setExtraImages(imageList.iterator());
which keeps the first page out of the "ExtraImages" and it gets added with the call to encode.
Final updated code is:
public static void main(String[] args) {
String inputDir = "C:\\tifSources";
File faxSource = new File(inputDir);
File file[] = faxSource.listFiles();
System.out.println("files are " + Arrays.toString(file));
int numImages = file.length;
List<BufferedImage> images = new ArrayList<BufferedImage>();
try
{
for (int i = 0; i < numImages; i++)
{
SeekableStream ss = new FileSeekableStream(file[i]);
ImageDecoder decoder = ImageCodec.createImageDecoder("tiff", ss, null);
int numPages = decoder.getNumPages();
for(int j = 0; j < numPages; j++)
{
PlanarImage op = new NullOpImage(decoder.decodeAsRenderedImage(j), null, null, OpImage.OP_IO_BOUND);
images.add(op.getAsBufferedImage());
}
}
TIFFEncodeParam params = new TIFFEncodeParam();
OutputStream out = new FileOutputStream(inputDir + "\\combined.tif");
ImageEncoder encoder = ImageCodec.createImageEncoder("tiff", out, params);
List<BufferedImage> imageList = new ArrayList<BufferedImage>();
for (int i = 1; i < images.size(); i++)
{
imageList.add(images.get(i));
}
params.setExtraImages(imageList.iterator());
encoder.encode(images.get(0));
out.close();
}
catch (Exception e)
{
System.out.println("Exception " + e);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting a large Pdf file with PDFBox gets large result files - java

Related

PDFBOX 2.0+ java flatten annotations freetext created by foxit

PDFBox: put two A4 pages on one A3

How to find blank pages inside a PDF using PDFBox?

How to read bookmarks in PDF using itext at multi level?

How to combine multiple multi-page tif files into a single tif

Categories

Resources