PDF font embedding not working using PDFBox - java

I am working on embedding fonts those are not embedded to PDF. For this, I am using PDFBox library to identify missing fonts using PDFont class. Using this I am able to identify list of missing fonts. But when I am trying to embed them using my local machine font's(grabbed TTF files from my local machine fonts folder), I am not able to do it, getting following result.
I am using following code to get list of (not embedded) fonts,
private static List<FontsDetails> checkAllFontsEmbeddedOrNot(PDDocument pdDocument) throws Exception {
List<FontsDetails> notEmbFonts = null;
try {
if(null != pdDocument){
PDPageTree pageTree = pdDocument.getDocumentCatalog().getPages();
notEmbFonts = new ArrayList<>();
for (PDPage pdPage : pageTree) {
PDResources resources = pdPage.getResources();
Iterable<COSName> cosNameIte = resources.getFontNames();
Iterator<COSName> names = cosNameIte.iterator();
while (names.hasNext()) {
COSName name = names.next();
PDFont font = resources.getFont(name);
boolean isEmbedded = font.isEmbedded();
if(!isEmbedded){
FontsDetails fontsDetails = new FontsDetails();
fontsDetails.setFontName(font.getName().toString());
fontsDetails.setFontSubType(font.getSubType());
notEmbFonts.add(fontsDetails);
}
}
}
}
} catch (Exception exception) {
logger.error("Exception occurred while validating fonts : ", exception);
throw new PDFUtilsException("Exception occurred while validating fonts : ",exception);
}
return notEmbFonts;
}
Following is the code which I am using to embed fonts which I am getting from above list,
public List<FontsDetails> embedFontToPdf(File pdf, FontsDetails fontToEmbed) {
ArrayList<FontsDetails> notSupportedFonts = new ArrayList<>();
try (PDDocument pdDocument = PDDocument.load(pdf)) {
LOGGER.info("Embedding font : " + fontToEmbed.getFontName());
InputStream ttfFileStream = PDFBoxOperationsUtility.class.getClassLoader()
.getResourceAsStream(fontToEmbed.getFontName() + ".ttf");//loading ttf file
if (null != ttfFileStream) {
PDFont font = PDType0Font.load(pdDocument, ttfFileStream);
PDPage pdfPage = new PDPage();
PDResources pdfResources = new PDResources();
pdfPage.setResources(pdfResources);
PDPageContentStream contentStream = new PDPageContentStream(pdDocument, pdfPage);
if (fontToEmbed.getFontSize() == 0) {
fontToEmbed.setFontSize(DEFAULT_FONT_SIZE);
}
font.encode("ANSI");
contentStream.setFont(font, fontToEmbed.getFontSize());
contentStream.close();
pdDocument.addPage(pdfPage);
pdDocument.save(pdf);
} else {
LOGGER.info("Font : " + fontToEmbed.getFontName() + " not supported");
notSupportedFonts.add(fontToEmbed);
}
} catch (Exception exception) {
notSupportedFonts.add(fontToEmbed);
LOGGER.error("Error ocurred while embedding font to pdf : " + pdf.getName(), exception);
}
return notSupportedFonts;
}
Could someone pleas help me to identify what mistake I am doing or any other approach I will need to take.

Related

No glyph found after getting Text and Font from existing pdf

My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..

Removing white space when an image is removed using apache PDFBox

I was trying to remove images from a PDF using Apache’s PDFBox with the following code:
//This will clear unwanted images but leaves blank space.
private ByteArrayOutputStream clearimages(ByteArrayOutputStream destinationStream) throws IOException {
PDDocument document = PDDocument.load(destinationStream.toByteArray());
PDDocumentCatalog catalog = document.getDocumentCatalog();
AtomicInteger keepImage = new AtomicInteger(0);
for (Object pageObj : catalog.getPages()) {
PDPage page = (PDPage) pageObj;
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
keepImage.incrementAndGet();
o = pdResources.getXObject(propertyName);
//we want to retain the first 4 images, others have to be
removed.
if (o instanceof PDImageXObject && keepImage.get() > 4) {
pdResources.getCOSObject().getCOSDictionary
(COSName.XOBJECT).removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.save(out);
document.close();
return out;
}
I was able to remove the image but could not remove the empty white space it leaves behind.
See image for the problem:
How do I get rid of the white space the image occupied?

delete am image from a PDF file using PDFbox

I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.
public class DeleteImage {
public static void removeImages(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
for (PDPage page : document.getPages()) {
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
System.out.println("propertyName" + propertyName);
page.getCOSObject().removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
for (COSName name : page.getResources().getPatternNames()) {
PDAbstractPattern pattern = page.getResources().getPattern(name);
System.out.println("have pattern");
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
System.out.println("original tokens size" + tokens.size());
List<Object> newTokens = new ArrayList<Object>();
for(int j=0; j<tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof Operator ) {
Operator op = (Operator)token;
System.out.println("operation" + op.getName());
//find image - remove it
if( op.getName().equals("Do") ) {
System.out.println("op equals Do");
newTokens.remove(newTokens.size()-1);
continue;
} else if ("BI".equals(op.getName())) {
System.out.println("inline -- op equals BI");
} else {
System.out.println("op not quals Do");
}
}
newTokens.add(token);
}
PDDocument newDoc = new PDDocument();
PDPage newPage = newDoc.importPage(page);
newPage.setResources(page.getResources());
System.out.println("tokens size" + newTokens.size());
PDStream newContents = new PDStream(newDoc);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter( out );
writer.writeTokens( newTokens);
out.close();
newPage.setContents( newContents );
}
document.save("RemoveImage.pdf");
document.close();
}
public static void remove(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
PDResources resources = null;
for (PDPage page : document.getPages()) {
resources = page.getResources();
for (COSName name : resources.getXObjectNames()) {
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDImageXObject) {
System.out.println("have image");
removeImages(pdfFile);
}
}
}
document.save("RemoveImage.pdf");
document.close();
}
}
If You Call remove...
In remove you
load the PDF into document,
iterate over the pages of document, and for each page
iterate over the XObject resources, and for each Xobject
check whether it is an image Xobject, and if it is
call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
After all that processing you save the unchanged document to "RemoveImage.pdf".
So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!
If You Call removeImages Directly...
In removeImages you do some changes but there are certain issues:
Whenever you find an image Xobject resource, you attempt to remove it from the page directly
page.getCOSObject().removeItem(propertyName);
but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.
You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.

PDFBox merge 2 pdf files side by side with java

I compare 2 pdf files and mark highlight on them.
When i using pdfbox to merge it for comparison . It have error missing highlight.
I using this function:
The function to merge 2 file pdfs with all pages of them to side by side.
function void generateSideBySidePDF() {
File pdf1File = new File(FILE1_PATH);
File pdf2File = new File(FILE2_PATH);
File outPdfFile = new File(OUTFILE_PATH);
PDDocument pdf1 = null;
PDDocument pdf2 = null;
PDDocument outPdf = null;
try {
pdf1 = PDDocument.load(pdf1File);
pdf2 = PDDocument.load(pdf2File);
outPdf = new PDDocument();
for(int pageNum = 0; pageNum < pdf1.getNumberOfPages(); pageNum++) {
// Create output PDF frame
PDRectangle pdf1Frame = pdf1.getPage(pageNum).getCropBox();
PDRectangle pdf2Frame = pdf2.getPage(pageNum).getCropBox();
PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));
// Create output page with calculated frame and add it to the document
COSDictionary dict = new COSDictionary();
dict.setItem(COSName.TYPE, COSName.PAGE);
dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
dict.setItem(COSName.CROP_BOX, outPdfFrame);
dict.setItem(COSName.ART_BOX, outPdfFrame);
PDPage outPdfPage = new PDPage(dict);
outPdf.addPage(outPdfPage);
// Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
LayerUtility layerUtility = new LayerUtility(outPdf);
PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdf1, pageNum);
PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdf2, pageNum);
// Add form objects to output page
AffineTransform afLeft = new AffineTransform();
layerUtility.appendFormAsLayer(outPdfPage, formPdf1, afLeft, "left" + pageNum);
AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
layerUtility.appendFormAsLayer(outPdfPage, formPdf2, afRight, "right" + pageNum);
}
outPdf.save(outPdfFile);
outPdf.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (pdf1 != null) pdf1.close();
if (pdf2 != null) pdf2.close();
if (outPdf != null) outPdf.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Insert this into your code after the "Source PDF pages has to be imported" segment to copy the annotations. The ones of the right PDF must have their rectangle moved.
// copy annotations
PDPage src1Page = pdf1.getPage(pageNum);
PDPage src2Page = pdf2.getPage(pageNum);
for (PDAnnotation ann : src1Page.getAnnotations())
{
outPdfPage.getAnnotations().add(ann);
}
for (PDAnnotation ann : src2Page.getAnnotations())
{
PDRectangle rect = ann.getRectangle();
ann.setRectangle(new PDRectangle(rect.getLowerLeftX() + pdf1Frame.getWidth(), rect.getLowerLeftY(), rect.getWidth(), rect.getHeight()));
outPdfPage.getAnnotations().add(ann);
}
Note that this code has a flaw - it works only with annotations WITH appearance stream (most have it). It will have weird effects for those that don't, in that case, one would have to adjust the coordinates depending on the annotation type. For highlights, it would be the quadpoints, for line it would be the line coordinates, etc, etc.

How to get image of a PDF page(included text). not images in a PDF page

I have tried a PDF page to image, But just extracted each images in the PDF page. not page image.
Below Code :
public class ExtractionPDFtoThumbImgs {
static String filePath = "/Users/tmdtjq/Downloads/PDFTest/test.pdf";
static String outputFilePath = "/Users/tmdtjq/Downloads/PDFTest/pageimages";
public static void change(File inputFile, File outputFolder) throws IOException {
//TODO check the input file exists and is PDF
//TODO for the treatment of PDF encrypted
PDDocument doc = null;
try {
doc = PDDocument.load(inputFile);
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i <allPages.size(); i++) {
PDPage page = allPages.get(i);
page.convertToImage();
BufferedImage image = page.convertToImage();
ImageIO.write(image, "jpg", new File(outputFolder.getAbsolutePath() + File.separator + (i + 1) + ".jpg"));
}
} finally {
if (doc != null) {
doc.close();
}
}
}
public static void main(String[] args) {
File inputFile = new File(ExtractionPDFtoThumbImgs.filePath);
File outputFolder = new File(ExtractionPDFtoThumbImgs.outputFilePath);
if(!outputFolder.exists()){
outputFolder.mkdirs();
}
try {
ExtractionPDFtoThumbImgs.change(inputFile, outputFolder);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Above code extract images in PDF page. not convert image in PDF page(included text).
Are there converting tools (PDF page to image) or Converting PDFBox class?
Please Suggest how to get image of a PDF page(included text). not to get images in a PDF page.
Try pdftocairo, it's part of poppler.
I was using imagemagick to convert PDF to images, but it relies on Ghostscript which is sometimes picky about the PDF you feed it so it was hit or miss...
So far pdftocairo has been solid.
http://poppler.freedesktop.org

Categories