delete am image from a PDF file using PDFbox

delete am image from a PDF file using PDFbox - java

I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.
public class DeleteImage {
public static void removeImages(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
for (PDPage page : document.getPages()) {
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
System.out.println("propertyName" + propertyName);
page.getCOSObject().removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
for (COSName name : page.getResources().getPatternNames()) {
PDAbstractPattern pattern = page.getResources().getPattern(name);
System.out.println("have pattern");
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
System.out.println("original tokens size" + tokens.size());
List<Object> newTokens = new ArrayList<Object>();
for(int j=0; j<tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof Operator ) {
Operator op = (Operator)token;
System.out.println("operation" + op.getName());
//find image - remove it
if( op.getName().equals("Do") ) {
System.out.println("op equals Do");
newTokens.remove(newTokens.size()-1);
continue;
} else if ("BI".equals(op.getName())) {
System.out.println("inline -- op equals BI");
} else {
System.out.println("op not quals Do");
}
}
newTokens.add(token);
}
PDDocument newDoc = new PDDocument();
PDPage newPage = newDoc.importPage(page);
newPage.setResources(page.getResources());
System.out.println("tokens size" + newTokens.size());
PDStream newContents = new PDStream(newDoc);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter( out );
writer.writeTokens( newTokens);
out.close();
newPage.setContents( newContents );
}
document.save("RemoveImage.pdf");
document.close();
}
public static void remove(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
PDResources resources = null;
for (PDPage page : document.getPages()) {
resources = page.getResources();
for (COSName name : resources.getXObjectNames()) {
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDImageXObject) {
System.out.println("have image");
removeImages(pdfFile);
}
}
}
document.save("RemoveImage.pdf");
document.close();
}
}

If You Call remove...
In remove you
load the PDF into document,
iterate over the pages of document, and for each page
iterate over the XObject resources, and for each Xobject
check whether it is an image Xobject, and if it is
call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
After all that processing you save the unchanged document to "RemoveImage.pdf".
So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!
If You Call removeImages Directly...
In removeImages you do some changes but there are certain issues:
Whenever you find an image Xobject resource, you attempt to remove it from the page directly
page.getCOSObject().removeItem(propertyName);
but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.
You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.

Related

Removing white space when an image is removed using apache PDFBox

I was trying to remove images from a PDF using Apache’s PDFBox with the following code:
//This will clear unwanted images but leaves blank space.
private ByteArrayOutputStream clearimages(ByteArrayOutputStream destinationStream) throws IOException {
PDDocument document = PDDocument.load(destinationStream.toByteArray());
PDDocumentCatalog catalog = document.getDocumentCatalog();
AtomicInteger keepImage = new AtomicInteger(0);
for (Object pageObj : catalog.getPages()) {
PDPage page = (PDPage) pageObj;
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
keepImage.incrementAndGet();
o = pdResources.getXObject(propertyName);
//we want to retain the first 4 images, others have to be
removed.
if (o instanceof PDImageXObject && keepImage.get() > 4) {
pdResources.getCOSObject().getCOSDictionary
(COSName.XOBJECT).removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.save(out);
document.close();
return out;
}
I was able to remove the image but could not remove the empty white space it leaves behind.
See image for the problem:
How do I get rid of the white space the image occupied?

PDF font embedding not working using PDFBox

I am working on embedding fonts those are not embedded to PDF. For this, I am using PDFBox library to identify missing fonts using PDFont class. Using this I am able to identify list of missing fonts. But when I am trying to embed them using my local machine font's(grabbed TTF files from my local machine fonts folder), I am not able to do it, getting following result.
I am using following code to get list of (not embedded) fonts,
private static List<FontsDetails> checkAllFontsEmbeddedOrNot(PDDocument pdDocument) throws Exception {
List<FontsDetails> notEmbFonts = null;
try {
if(null != pdDocument){
PDPageTree pageTree = pdDocument.getDocumentCatalog().getPages();
notEmbFonts = new ArrayList<>();
for (PDPage pdPage : pageTree) {
PDResources resources = pdPage.getResources();
Iterable<COSName> cosNameIte = resources.getFontNames();
Iterator<COSName> names = cosNameIte.iterator();
while (names.hasNext()) {
COSName name = names.next();
PDFont font = resources.getFont(name);
boolean isEmbedded = font.isEmbedded();
if(!isEmbedded){
FontsDetails fontsDetails = new FontsDetails();
fontsDetails.setFontName(font.getName().toString());
fontsDetails.setFontSubType(font.getSubType());
notEmbFonts.add(fontsDetails);
}
}
}
}
} catch (Exception exception) {
logger.error("Exception occurred while validating fonts : ", exception);
throw new PDFUtilsException("Exception occurred while validating fonts : ",exception);
}
return notEmbFonts;
}
Following is the code which I am using to embed fonts which I am getting from above list,
public List<FontsDetails> embedFontToPdf(File pdf, FontsDetails fontToEmbed) {
ArrayList<FontsDetails> notSupportedFonts = new ArrayList<>();
try (PDDocument pdDocument = PDDocument.load(pdf)) {
LOGGER.info("Embedding font : " + fontToEmbed.getFontName());
InputStream ttfFileStream = PDFBoxOperationsUtility.class.getClassLoader()
.getResourceAsStream(fontToEmbed.getFontName() + ".ttf");//loading ttf file
if (null != ttfFileStream) {
PDFont font = PDType0Font.load(pdDocument, ttfFileStream);
PDPage pdfPage = new PDPage();
PDResources pdfResources = new PDResources();
pdfPage.setResources(pdfResources);
PDPageContentStream contentStream = new PDPageContentStream(pdDocument, pdfPage);
if (fontToEmbed.getFontSize() == 0) {
fontToEmbed.setFontSize(DEFAULT_FONT_SIZE);
}
font.encode("ANSI");
contentStream.setFont(font, fontToEmbed.getFontSize());
contentStream.close();
pdDocument.addPage(pdfPage);
pdDocument.save(pdf);
} else {
LOGGER.info("Font : " + fontToEmbed.getFontName() + " not supported");
notSupportedFonts.add(fontToEmbed);
}
} catch (Exception exception) {
notSupportedFonts.add(fontToEmbed);
LOGGER.error("Error ocurred while embedding font to pdf : " + pdf.getName(), exception);
}
return notSupportedFonts;
}
Could someone pleas help me to identify what mistake I am doing or any other approach I will need to take.

PDFBox merge 2 pdf files side by side with java

I compare 2 pdf files and mark highlight on them.
When i using pdfbox to merge it for comparison . It have error missing highlight.
I using this function:
The function to merge 2 file pdfs with all pages of them to side by side.
function void generateSideBySidePDF() {
File pdf1File = new File(FILE1_PATH);
File pdf2File = new File(FILE2_PATH);
File outPdfFile = new File(OUTFILE_PATH);
PDDocument pdf1 = null;
PDDocument pdf2 = null;
PDDocument outPdf = null;
try {
pdf1 = PDDocument.load(pdf1File);
pdf2 = PDDocument.load(pdf2File);
outPdf = new PDDocument();
for(int pageNum = 0; pageNum < pdf1.getNumberOfPages(); pageNum++) {
// Create output PDF frame
PDRectangle pdf1Frame = pdf1.getPage(pageNum).getCropBox();
PDRectangle pdf2Frame = pdf2.getPage(pageNum).getCropBox();
PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));
// Create output page with calculated frame and add it to the document
COSDictionary dict = new COSDictionary();
dict.setItem(COSName.TYPE, COSName.PAGE);
dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
dict.setItem(COSName.CROP_BOX, outPdfFrame);
dict.setItem(COSName.ART_BOX, outPdfFrame);
PDPage outPdfPage = new PDPage(dict);
outPdf.addPage(outPdfPage);
// Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
LayerUtility layerUtility = new LayerUtility(outPdf);
PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdf1, pageNum);
PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdf2, pageNum);
// Add form objects to output page
AffineTransform afLeft = new AffineTransform();
layerUtility.appendFormAsLayer(outPdfPage, formPdf1, afLeft, "left" + pageNum);
AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
layerUtility.appendFormAsLayer(outPdfPage, formPdf2, afRight, "right" + pageNum);
}
outPdf.save(outPdfFile);
outPdf.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (pdf1 != null) pdf1.close();
if (pdf2 != null) pdf2.close();
if (outPdf != null) outPdf.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

Insert this into your code after the "Source PDF pages has to be imported" segment to copy the annotations. The ones of the right PDF must have their rectangle moved.
// copy annotations
PDPage src1Page = pdf1.getPage(pageNum);
PDPage src2Page = pdf2.getPage(pageNum);
for (PDAnnotation ann : src1Page.getAnnotations())
{
outPdfPage.getAnnotations().add(ann);
}
for (PDAnnotation ann : src2Page.getAnnotations())
{
PDRectangle rect = ann.getRectangle();
ann.setRectangle(new PDRectangle(rect.getLowerLeftX() + pdf1Frame.getWidth(), rect.getLowerLeftY(), rect.getWidth(), rect.getHeight()));
outPdfPage.getAnnotations().add(ann);
}
Note that this code has a flaw - it works only with annotations WITH appearance stream (most have it). It will have weird effects for those that don't, in that case, one would have to adjust the coordinates depending on the annotation type. For highlights, it would be the quadpoints, for line it would be the line coordinates, etc, etc.

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.

This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks

I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}

Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}

I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}

Replace a image with Apache POI

I need help by replacing an image with another image in Word using Apache POI or any other library that might do the job. I know how to replace a word using Apache POI but I can't figure a way out to replace an image.
public static void main(String[] args) throws FileNotFoundException {
String c22 = "OTHER WORD";
try {
XWPFDocument doc = new XWPFDocument(OPCPackage.open("imagine.docx"));
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null ) {
String imgFile = "imaginedeschis.jpg";
try (FileInputStream is = new FileInputStream(imgFile)) {
r.addPicture(is, XWPFDocument.PICTURE_TYPE_JPEG, imgFile,
Units.toEMU(200), Units.toEMU(200)); // 200x200 pixels
text = text.replace("1ST WORD", c22);
} // 200x200 pixels
r.setText(text, 0);
}
}
}
}
doc.write(new FileOutputStream("output.docx"));
} catch (InvalidFormatException | IOException m){ }
}

I am using below Java code to replace one image in Word document (*.docx). Please share if anyone have better approach.
public XWPFDocument replaceImage(XWPFDocument document, String imageOldName, String imagePathNew, int newImageWidth, int newImageHeight) throws Exception {
try {
LOG.info("replaceImage: old=" + imageOldName + ", new=" + imagePathNew);
int imageParagraphPos = -1;
XWPFParagraph imageParagraph = null;
List<IBodyElement> documentElements = document.getBodyElements();
for(IBodyElement documentElement : documentElements){
imageParagraphPos ++;
if(documentElement instanceof XWPFParagraph){
imageParagraph = (XWPFParagraph) documentElement;
if(imageParagraph != null && imageParagraph.getCTP() != null && imageParagraph.getCTP().toString().trim().indexOf(imageOldName) != -1) {
break;
}
}
}
if (imageParagraph == null) {
throw new Exception("Unable to replace image data due to the exception:\n"
+ "'" + imageOldName + "' not found in in document.");
}
ParagraphAlignment oldImageAlignment = imageParagraph.getAlignment();
// remove old image
document.removeBodyElement(imageParagraphPos);
// now add new image
// BELOW LINE WILL CREATE AN IMAGE
// PARAGRAPH AT THE END OF THE DOCUMENT.
// REMOVE THIS IMAGE PARAGRAPH AFTER
// SETTING THE NEW IMAGE AT THE OLD IMAGE POSITION
XWPFParagraph newImageParagraph = document.createParagraph();
XWPFRun newImageRun = newImageParagraph.createRun();
//newImageRun.setText(newImageText);
newImageParagraph.setAlignment(oldImageAlignment);
try (FileInputStream is = new FileInputStream(imagePathNew)) {
newImageRun.addPicture(is, XWPFDocument.PICTURE_TYPE_JPEG, imagePathNew,
Units.toEMU(newImageWidth), Units.toEMU(newImageHeight));
}
// set new image at the old image position
document.setParagraph(newImageParagraph, imageParagraphPos);
// NOW REMOVE REDUNDANT IMAGE FORM THE END OF DOCUMENT
document.removeBodyElement(document.getBodyElements().size() - 1);
return document;
} catch (Exception e) {
throw new Exception("Unable to replace image '" + imageOldName + "' due to the exception:\n" + e);
} finally {
// cleanup code
}
}
Please visit https://bitbucket.org/wishcoder/java-poi-word-document/wiki/Home for more examples like:
Open existing Microsoft Word Document (*.docx)
Clone Table in Word Document and add new data to cloned table
Update existing Table->Cell data in document
Update existing Hyper Link in document
Replace existing Image in document
Save update Microsoft Word Document (*.docx)

I recommend using transparent tables to track images. following code will replace table row 0 col 1 cell's picture.
List<XWPFParagraph> paragraphs = table.getRow(0).getCell(1).getParagraphs();
for (XWPFParagraph para: paragraphs) {
for (XWPFRun r : para.getRuns()) {
CTR ctr = r.getCTR();
List<CTDrawing> drawings = ctr.getDrawingList();
for (int i = 0; i < drawings.size(); i++) {
ctr.removeDrawing(i);
}
}
}
XWPFParagraph paragraph = table.getRow(0).getCell(1).addParagraph();
XWPFRun run = paragraph.createRun();
FileInputStream fis = new FileInputStream('filepath');
run.addPicture(fis, XWPFDocument.PICTURE_TYPE_PNG, "filename", Units.toEMU(200), Units.toEMU(60));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

delete am image from a PDF file using PDFbox - java

Related

Removing white space when an image is removed using apache PDFBox

PDF font embedding not working using PDFBox

PDFBox merge 2 pdf files side by side with java

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Replace a image with Apache POI

Categories

Resources