I compare 2 pdf files and mark highlight on them.
When i using pdfbox to merge it for comparison . It have error missing highlight.
I using this function:
The function to merge 2 file pdfs with all pages of them to side by side.
function void generateSideBySidePDF() {
File pdf1File = new File(FILE1_PATH);
File pdf2File = new File(FILE2_PATH);
File outPdfFile = new File(OUTFILE_PATH);
PDDocument pdf1 = null;
PDDocument pdf2 = null;
PDDocument outPdf = null;
try {
pdf1 = PDDocument.load(pdf1File);
pdf2 = PDDocument.load(pdf2File);
outPdf = new PDDocument();
for(int pageNum = 0; pageNum < pdf1.getNumberOfPages(); pageNum++) {
// Create output PDF frame
PDRectangle pdf1Frame = pdf1.getPage(pageNum).getCropBox();
PDRectangle pdf2Frame = pdf2.getPage(pageNum).getCropBox();
PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));
// Create output page with calculated frame and add it to the document
COSDictionary dict = new COSDictionary();
dict.setItem(COSName.TYPE, COSName.PAGE);
dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
dict.setItem(COSName.CROP_BOX, outPdfFrame);
dict.setItem(COSName.ART_BOX, outPdfFrame);
PDPage outPdfPage = new PDPage(dict);
outPdf.addPage(outPdfPage);
// Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
LayerUtility layerUtility = new LayerUtility(outPdf);
PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdf1, pageNum);
PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdf2, pageNum);
// Add form objects to output page
AffineTransform afLeft = new AffineTransform();
layerUtility.appendFormAsLayer(outPdfPage, formPdf1, afLeft, "left" + pageNum);
AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
layerUtility.appendFormAsLayer(outPdfPage, formPdf2, afRight, "right" + pageNum);
}
outPdf.save(outPdfFile);
outPdf.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (pdf1 != null) pdf1.close();
if (pdf2 != null) pdf2.close();
if (outPdf != null) outPdf.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Insert this into your code after the "Source PDF pages has to be imported" segment to copy the annotations. The ones of the right PDF must have their rectangle moved.
// copy annotations
PDPage src1Page = pdf1.getPage(pageNum);
PDPage src2Page = pdf2.getPage(pageNum);
for (PDAnnotation ann : src1Page.getAnnotations())
{
outPdfPage.getAnnotations().add(ann);
}
for (PDAnnotation ann : src2Page.getAnnotations())
{
PDRectangle rect = ann.getRectangle();
ann.setRectangle(new PDRectangle(rect.getLowerLeftX() + pdf1Frame.getWidth(), rect.getLowerLeftY(), rect.getWidth(), rect.getHeight()));
outPdfPage.getAnnotations().add(ann);
}
Note that this code has a flaw - it works only with annotations WITH appearance stream (most have it). It will have weird effects for those that don't, in that case, one would have to adjust the coordinates depending on the annotation type. For highlights, it would be the quadpoints, for line it would be the line coordinates, etc, etc.
Related
I was trying to remove images from a PDF using Apache’s PDFBox with the following code:
//This will clear unwanted images but leaves blank space.
private ByteArrayOutputStream clearimages(ByteArrayOutputStream destinationStream) throws IOException {
PDDocument document = PDDocument.load(destinationStream.toByteArray());
PDDocumentCatalog catalog = document.getDocumentCatalog();
AtomicInteger keepImage = new AtomicInteger(0);
for (Object pageObj : catalog.getPages()) {
PDPage page = (PDPage) pageObj;
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
keepImage.incrementAndGet();
o = pdResources.getXObject(propertyName);
//we want to retain the first 4 images, others have to be
removed.
if (o instanceof PDImageXObject && keepImage.get() > 4) {
pdResources.getCOSObject().getCOSDictionary
(COSName.XOBJECT).removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.save(out);
document.close();
return out;
}
I was able to remove the image but could not remove the empty white space it leaves behind.
See image for the problem:
How do I get rid of the white space the image occupied?
I am having html content store as a raw string in my database and I like to print it in pdf, but with custom size, for example page size to be 10cm width and 7 com height, not standard A4 format.
Can someone gives me some examples if it is possible.
ByteArrayOutputStream out = new ByteArrayOutputStream();
PDRectangle rec = new PDRectangle(recWidth, recHeight);
PDPage page = new PDPage(rec);
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
String htmlContent = "<b>Hello world</b>" + content;
builder.withHtmlContent(htmlContent, "");
document.addPage(page);
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
ex.printStackTrace();
}
return new ByteArrayInputStream(out.toByteArray());
This code generates for me 2 files, one small and one A4.
UPDATE:
I tried this one:
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
String htmlContent = "<b>content</b>";
builder.withHtmlContent(htmlContent, "");
builder.usePDDocument(document);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
document.save(out);
} catch (Exception e) {
log.error(">>> The creation of PDF is invalid!");
}
But in this case content is not shown, if I remove useDefaultPageSize, content will be shown
I didn't check this solution before, but try initialise the builder object with your desired page size and document type like below
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
builder.usePdfAConformance(PdfRendererBuilder.PdfAConformance.PDFA_3_A);
the lib include many PDF format next is PdfAConformance Enum with possible values
PdfAConformance Enum
I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.
public class DeleteImage {
public static void removeImages(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
for (PDPage page : document.getPages()) {
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
System.out.println("propertyName" + propertyName);
page.getCOSObject().removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
for (COSName name : page.getResources().getPatternNames()) {
PDAbstractPattern pattern = page.getResources().getPattern(name);
System.out.println("have pattern");
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
System.out.println("original tokens size" + tokens.size());
List<Object> newTokens = new ArrayList<Object>();
for(int j=0; j<tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof Operator ) {
Operator op = (Operator)token;
System.out.println("operation" + op.getName());
//find image - remove it
if( op.getName().equals("Do") ) {
System.out.println("op equals Do");
newTokens.remove(newTokens.size()-1);
continue;
} else if ("BI".equals(op.getName())) {
System.out.println("inline -- op equals BI");
} else {
System.out.println("op not quals Do");
}
}
newTokens.add(token);
}
PDDocument newDoc = new PDDocument();
PDPage newPage = newDoc.importPage(page);
newPage.setResources(page.getResources());
System.out.println("tokens size" + newTokens.size());
PDStream newContents = new PDStream(newDoc);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter( out );
writer.writeTokens( newTokens);
out.close();
newPage.setContents( newContents );
}
document.save("RemoveImage.pdf");
document.close();
}
public static void remove(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
PDResources resources = null;
for (PDPage page : document.getPages()) {
resources = page.getResources();
for (COSName name : resources.getXObjectNames()) {
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDImageXObject) {
System.out.println("have image");
removeImages(pdfFile);
}
}
}
document.save("RemoveImage.pdf");
document.close();
}
}
If You Call remove...
In remove you
load the PDF into document,
iterate over the pages of document, and for each page
iterate over the XObject resources, and for each Xobject
check whether it is an image Xobject, and if it is
call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
After all that processing you save the unchanged document to "RemoveImage.pdf".
So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!
If You Call removeImages Directly...
In removeImages you do some changes but there are certain issues:
Whenever you find an image Xobject resource, you attempt to remove it from the page directly
page.getCOSObject().removeItem(propertyName);
but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.
You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.
I need to convert scanned PDF to grayscale PDF. I found 2 solutions for that.
First one is to just use renderImage
private void convertToGray() throws IOException {
File pdfFile = new File(PATH);
try (PDDocument originalPdf = PDDocument.load(pdfFile);
PDDocument doc = new PDDocument()) {
LOGGER.info("Current heap after loading file: {}", Runtime.getRuntime().totalMemory());
PDFRenderer pdfRenderer = new PDFRenderer(originalPdf);
for (int pageNum = 0; pageNum < originalPdf.getNumberOfPages(); pageNum++) {
// PDImageXObject pdImage = LosslessFactory.createFromImage(doc, bufferedImage);
BufferedImage grayImage = pdfRenderer.renderImageWithDPI(pageNum, 300F, ImageType.GRAY);
PDImageXObject pdImage = JPEGFactory.createFromImage(doc, grayImage);
float pageWight = originalPdf.getPage(pageNum).getMediaBox().getWidth();
float pageHeight = originalPdf.getPage(pageNum).getMediaBox().getHeight();
PDPage page = new PDPage(new PDRectangle(pageWight, pageHeight));
doc.addPage(page);
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
contentStream.drawImage(pdImage, 0F, 0F, pageWight, pageHeight);
}
}
doc.save(NEW_PATH);
}
}
But this leads to increase size of the file (because some PDFs has less DPI than 300.
Second one is to just replace existing image with gray analog
private void convertByImageToGray() throws IOException {
File pdfFile = new File(PATH);
try (PDDocument document = PDDocument.load(pdfFile)) {
List<COSObject> objects = document.getDocument().getObjectsByType(COSName.IMAGE);
for (COSObject object : objects) {
LOGGER.info("Class: {}; {}", object.getClass(), object.toString());
}
for (int pageNum = 0; pageNum < document.getNumberOfPages(); pageNum++) {
PDPage page = document.getPage(pageNum);
replaceImage(document, page);
}
document.save(NEW_PATH);
}
}
private void replaceImage(PDDocument document, PDPage page) throws IOException {
PDResources resources = page.getResources();
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames != null) {
for (COSName xObjectName : xObjectNames) {
PDXObject object = resources.getXObject(xObjectName);
if (object instanceof PDImageXObject) {
PDImageXObject img1 = (PDImageXObject) object;
BufferedImage bufferedImage1 = img1.getImage();
BufferedImage grayBufferedImage = convertBufferedImageToGray(bufferedImage1);
// PDImageXObject grayImage = JPEGFactory.createFromImage(document, grayBufferedImage);
PDImageXObject grayImage = LosslessFactory.createFromImage(document, grayBufferedImage);
resources.put(xObjectName, grayImage);
}
}
}
}
private static BufferedImage convertBufferedImageToGray(BufferedImage sourceImg) {
ColorSpace cs = ColorSpace.getInstance(ColorSpace.CS_GRAY);
ColorConvertOp op = new ColorConvertOp(sourceImg.getColorModel().getColorSpace(), cs, null);
op.filter(sourceImg, sourceImg);
return sourceImg;
}
But still some files increase in size like 3 times (even they were already grayscale; interesting that int this case JPEGFactory produces larger files than LosslessFactory). All images in grayscale PDF have the same size as original ones. And I don't understand why.
Maybe there is a better way to make grayscale PDF with predictable size (except ghostscript)?
UPDATE: I've just realized that the issue is with creating PDF from image. It does not compress as well.
For example, I have dummy 1-page scan file that is less than 1 Mb. But if I get image from it (directly copying via Acrobat Reader to Paint, or via code above) it size is ~8-10 Mb depending on the method. And if I create new PDF from this image it's barely compressed. Here is example code:
File pdfFile = new File(FULL_FILE);
try (PDDocument document = PDDocument.load(pdfFile)) {
PDPage page = new PDPage();
document.addPage(page);
PDImageXObject pdImage = PDImageXObject.createFromFile("example.png", document);
try (PDPageContentStream contents = new PDPageContentStream(document, page)) {
contents.drawImage(pdImage, 0F, 0F);
}
document.save(FULL_FILE_NEW);
}
Yes LosslessFactory produces smaller files compared to JPEGFactory
In the below link there are different methods to try and achieve the same goal. Overall the best quality gray scale image was the one from Option 6, however this was by no means the fastest (I myself used Option 4). Comparisons are also provided for you to choose
This link contains possible ways to convert color images to black. It helped me a lot.
Let me know if it works for you and approve my answer if it helped.
I have a pdf containing 2 blank images. I need to replace both the images with 2 separate images using PDFBox. The problem is, both the blank images appear to have the same resource. So, if I replace one, the other one is replaced with the same image as well.
I followed this example and tried overriding the processOperator() method and replaced the images based on the imageHeight. However, it still ends up replacing both the images with the same image. This is my code thus far:
protected void processOperator( PDFOperator operator, List arguments ) throws IOException
{
String operation = operator.getOperation();
if( INVOKE_OPERATOR.equals(operation) )
{
COSName objectName = (COSName)arguments.get( 0 );
Map<String, PDXObject> xobjects = getResources().getXObjects();
PDXObject xobject = (PDXObject)xobjects.get( objectName.getName() );
if( xobject instanceof PDXObjectImage )
{
PDXObjectImage blankImage = (PDXObjectImage)xobject;
int imageWidth = blankImage.getWidth();
int imageHeight = blankImage.getHeight();
System.out.println("Image width >>> "+imageWidth+" height >>>> "+imageHeight);
// Check if it is blank image 1 based on height
if(imageHeight < 480){
File logo = new File("abc.jpg");
BufferedImage bufferedImage = ImageIO.read(logo);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write( bufferedImage, "jpg", baos );
baos.flush();
byte[] logoImageInBytes = baos.toByteArray();
baos.close();
// label will be used to replace the blank image
label = logoImageInBytes;
}
BufferedImage img = ImageIO.read(new ByteArrayInputStream(label));
BufferedImage resizedImage = Scalr.resize(img, Scalr.Method.BALANCED, Scalr.Mode.FIT_EXACT, img.getWidth(), img.getHeight());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(resizedImage, "jpg", baos);
// Replace empty image in template with the image generated from shipping label byte array
PDXObjectImage validImage = new PDJpeg(doc, new ByteArrayInputStream(baos.toByteArray()));
blankImage.getCOSStream().replaceWithStream(validImage.getCOSStream());
}
Now, when I remove the if block which checks if (imageHeight < 480), it prints the imageHeight as 30 and 470 for the blank images. However, when I add the if block, it prints the imageHeight as 480 and 1500 and never goes inside the if block because of which both the blank images end up getting replaced by the same image.
What's going on here? I'm new to PDFBox, so I am unsure if my code is correct.
While first thinking about a generic way to actually replace the existing Image by the new Images, I agree with #TilmanHausherr that a more simple solution would be to simply add an extra content stream with two images in the size / position you need covering the existing Image.
This approach is easier to implement (even generically) and less error-prone than actual replacement.
In a generic solution we do not have the Image positions beforehand. To determine them, we can use this helper class (which essentially is a rip-off of the PDFBox example PrintImageLocations):
public class ImageLocator extends PDFStreamEngine
{
private static final String INVOKE_OPERATOR = "Do";
public ImageLocator() throws IOException
{
super(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PDFTextStripper.properties", true));
}
public List<ImageLocation> getLocations()
{
return new ArrayList<ImageLocation>(locations);
}
protected void processOperator(PDFOperator operator, List<COSBase> arguments) throws IOException
{
String operation = operator.getOperation();
if (INVOKE_OPERATOR.equals(operation))
{
COSName objectName = (COSName) arguments.get(0);
Map<String, PDXObject> xobjects = getResources().getXObjects();
PDXObject xobject = (PDXObject) xobjects.get(objectName.getName());
if (xobject instanceof PDXObjectImage)
{
PDXObjectImage image = (PDXObjectImage) xobject;
PDPage page = getCurrentPage();
Matrix matrix = getGraphicsState().getCurrentTransformationMatrix();
locations.add(new ImageLocation(page, matrix, image));
}
else if (xobject instanceof PDXObjectForm)
{
// save the graphics state
getGraphicsStack().push((PDGraphicsState) getGraphicsState().clone());
PDPage page = getCurrentPage();
PDXObjectForm form = (PDXObjectForm) xobject;
COSStream invoke = (COSStream) form.getCOSObject();
PDResources pdResources = form.getResources();
if (pdResources == null)
{
pdResources = page.findResources();
}
// if there is an optional form matrix, we have to
// map the form space to the user space
Matrix matrix = form.getMatrix();
if (matrix != null)
{
Matrix xobjectCTM = matrix.multiply(getGraphicsState().getCurrentTransformationMatrix());
getGraphicsState().setCurrentTransformationMatrix(xobjectCTM);
}
processSubStream(page, pdResources, invoke);
// restore the graphics state
setGraphicsState((PDGraphicsState) getGraphicsStack().pop());
}
}
else
{
super.processOperator(operator, arguments);
}
}
public class ImageLocation
{
public ImageLocation(PDPage page, Matrix matrix, PDXObjectImage image)
{
this.page = page;
this.matrix = matrix;
this.image = image;
}
public PDPage getPage()
{
return page;
}
public Matrix getMatrix()
{
return matrix;
}
public PDXObjectImage getImage()
{
return image;
}
final PDPage page;
final Matrix matrix;
final PDXObjectImage image;
}
final List<ImageLocation> locations = new ArrayList<ImageLocation>();
}
(ImageLocator.java)
In contrast to the example class this helper stores the locations in a list instead of printing them.
We now can cover existing images using code like this:
try ( InputStream resource = getClass().getResourceAsStream("sample.pdf");
InputStream left = getClass().getResourceAsStream("left.png");
InputStream right = getClass().getResourceAsStream("right.png");
PDDocument document = PDDocument.load(resource) )
{
if (document.isEncrypted())
{
document.decrypt("");
}
PDJpeg leftImage = new PDJpeg(document, ImageIO.read(left));
PDJpeg rightImage = new PDJpeg(document, ImageIO.read(right));
// Locate images
ImageLocator locator = new ImageLocator();
List<?> allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++)
{
PDPage page = (PDPage) allPages.get(i);
locator.processStream(page, page.findResources(), page.getContents().getStream());
}
// cover images
for (ImageLocation location : locator.getLocations())
{
// Decide on a replacement
PDRectangle cropBox = location.getPage().findCropBox();
float center = (cropBox.getLowerLeftX() + cropBox.getUpperRightX()) / 2.0f;
PDJpeg image = location.getMatrix().getXPosition() < center ? leftImage : rightImage;
AffineTransform transform = location.getMatrix().createAffineTransform();
PDPageContentStream content = new PDPageContentStream(document, location.getPage(), true, false, true);
content.drawXObject(image, transform);
content.close();
}
document.save(new File(RESULT_FOLDER, "sample-changed.pdf"));
}
(OverwriteImage)
This sample covers all images on the left half of their respective page with left.png and all others with right.png.
I have no implementation or example, but I want to illustrate you a possible way to do what you want by the following steps:
Since you need 2 Images (lets tell them imageA and imageB) in the pdf instead of 1 (which is the blank one). You have to add both of them to the pdf.
save the file temporary - optional, it could work without rewriting the pdf
reopen the file - optional, if you don't need step 2, you also don't need this step
Then replace the blank image with imageA or imageB
Remove the blank image from the pdf
Save the pdf