I'm trying to print a post-processed (filled) PDF-Template, which was created in LibreOffice and contains filled out form field.
The PDFBox svn is nice and has a lot of examples how to do so. Getting the PDF and the AcroFormat of it is easy, and even editing and saving the modified PDF to disk works as expected. But this is not my goal. I want a PDF which has the fields filled out and then being removed with only the text remaining.
I tried everything on stackoverflow regarding PDFBox, from flatting the acroform to setting readonly properties on the fields and other meta info, installed the necessary fonts and much more. Everytime I printed the PDF to file, the text (edited and non edited) which was in a text field disappeared and the textfields were gone.
But then I tried to create a PDF from scratch with PDFBox and printing works like expected. The textfields were in the generated template and the printed pdf file contained the text I wanted, with the corresponding forms removed.
So I used the PDF Debugger from PDFBox to analyse the structure of the PDF and noticed that within the preview of the debugger, my PDF does not contain the text in the text field, exported from LibreOffice. BUT in the tree structure the PDF Annotation is clearly there (/DV and /V) and looks quiet similar to the pdfbox created version, which is working.
For testing I created a simple pdf with just one text field with name "test" and content "Foobar". Also the background and border color were changed to see if anything was successfully printed out.
PDDocument document = null;
try {
document = PDDocument.load(new File("<filepath>\\<filename>"));
} catch (final InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (final IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
PrintFields.createDummyPDF("<filepath>\\<filename>");
PrintFields.printFields(document); //debug output
//Getting pdf meta infos
final PDDocumentCatalog docCatalog = document.getDocumentCatalog();
final PDAcroForm acroForm = docCatalog.getAcroForm();
docCatalog.setAcroForm(acroForm);
//setting the appearance
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
acroForm.setDefaultResources(resources);
String defaultAppearanceString = "/Helv 0 Tf 0 g";
acroForm.setDefaultAppearance(defaultAppearanceString);
for(final PDField f : acroForm.getFields()) {
if(f instanceof PDTextField) {
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
final List<PDAnnotationWidget> widgets = ((PDTextField)f).getWidgets();
widgets.get(0).setAppearanceState(defaultAppearanceString);
}
}
for(final PDField f : acroForm.getFields()) {
f.setReadOnly(true);
}
// save modified pdf to file
document.save("<filepath>\\<filename>");
//print to file (to pdf)
if (job.printDialog()) {
try {
// Desktop.getDesktop().print();
job.print();
} catch (final PrinterException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// copied from pdfbox examples
public static void createDummyPDF(final String path) throws IOException
{
// Create a new document with an empty page.
try (PDDocument document = new PDDocument())
{
final PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
// Adobe Acrobat uses Helvetica as a default font and
// stores that under the name '/Helv' in the resources dictionary
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
// Add a new AcroForm and add that to the document
final PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
// Add and set the resources and default appearance at the form level
acroForm.setDefaultResources(resources);
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/Helv 0 Tf 0 g";
acroForm.setDefaultAppearance(defaultAppearanceString);
// Add a form field to the form.
final PDTextField textBox = new PDTextField(acroForm);
textBox.setPartialName("SampleField");
// Acrobat sets the font size to 12 as default
// This is done by setting the font size to '12' on the
// field level.
// The text color is set to blue in this example.
// To use black, replace "0 0 1 rg" with "0 0 0 rg" or "0 g".
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
textBox.setDefaultAppearance(defaultAppearanceString);
// add the field to the acroform
acroForm.getFields().add(textBox);
// Specify the widget annotation associated with the field
final PDAnnotationWidget widget = textBox.getWidgets().get(0);
final PDRectangle rect = new PDRectangle(50, 750, 200, 50);
widget.setRectangle(rect);
widget.setPage(page);
// set green border and yellow background
// if you prefer defaults, just delete this code block
final PDAppearanceCharacteristicsDictionary fieldAppearance
= new PDAppearanceCharacteristicsDictionary(new COSDictionary());
fieldAppearance.setBorderColour(new PDColor(new float[]{0,1,0}, PDDeviceRGB.INSTANCE));
fieldAppearance.setBackground(new PDColor(new float[]{1,1,0}, PDDeviceRGB.INSTANCE));
widget.setAppearanceCharacteristics(fieldAppearance);
// make sure the widget annotation is visible on screen and paper
widget.setPrinted(true);
// Add the widget annotation to the page
page.getAnnotations().add(widget);
// set the field value
textBox.setValue("Sample field");
document.save(path);
}
}
//copied from pdfbox examples
public static void processFields(final List<PDField> fields, final PDResources resources) {
fields.stream().forEach(f -> {
f.setReadOnly(true);
final COSDictionary cosObject = f.getCOSObject();
final String value = cosObject.getString(COSName.DV) == null ?
cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
try {
f.setValue(value);
} catch (final IOException e) {
if (e.getMessage().matches("Could not find font: /.*")) {
final String fontName = e.getMessage().replaceAll("^[^/]*/", "");
System.out.println("Adding fallback font for: " + fontName);
resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
try {
f.setValue(value);
} catch (final IOException e1) {
e1.printStackTrace();
}
} else {
e.printStackTrace();
}
}
if (f instanceof PDNonTerminalField) {
processFields(((PDNonTerminalField) f).getChildren(), resources);
}
});
I would expect that the pdfs generated by the document.save() and job.print() to look identical in the Viewer, but they do not.
If I take the document.save() generated pdf with readonly disabled, I can use a PDF Viewer like FoxitReader to fill the form and print it again. This produces the right output. Using the job.print() version, leads to disappearing of the text contained in the (text) form field.
Has anyone a clue why this is the case?
I'm using PDFBox 2.0.13 (latest release) and LibreOffice 6.1.4.2.
Here are the refered files and here you can download the debugger (jar file, runnable with java -jar ).
Related
My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..
I have been trying to make a fillable PDF file with LibreOffice Writer 7.2.2.2. Here is how the document looks like:
All fields right of the vertical lines are form textboxes, each one having its own name(tbxOrderId, tbxFullName...). Each textbox uses SF Pro Text Light as font. Only the one on the bottom right(tbxTotal) - Total €123.00 has Oswald Regular. The document looks alright when I fill these fields with LibreOffice Writer.
Below this are my export settings. I chose Archive PDF A-2b in order to embed the fonts into the document.
Here is the output when I run pdffonts to the exported PDF file.
However, when I run the following code which just changes the values of tbxOrderId and tbxTotal, the output PDF document is missing these fonts.
public class Start {
public static void main(String[] args) {
try {
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
This is how the output document looks like:
I tried to add the font manually by referring to this Stackoverflow question, but still no success:
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
InputStream font_file = ClassLoader.getSystemResourceAsStream("Oswald-Regular.ttf");
PDType0Font font = PDType0Font.load(pDDocument, font_file, false);
if (font_file != null) font_file.close();
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null) resources = new PDResources();
resources.put(COSName.getPDFName("Oswald-Regular"), font);
pDAcroForm.setDefaultResources(resources);
pDAcroForm.refreshAppearances();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
After I write into these textbox fields, I want to flatten the document.
Here is my folder structure:
System: Ubuntu 20.04
Also, here is a link to the ODT file that I then export to a PDF and the exported PDF.
Your file doesn't have correct appearance streams for the fields, this is a bug from the software that created the PDF. Call pDAcroForm.refreshAppearances(); as early as possible.
The code in pastebin is fine (it is based on CreateSimpleFormWithEmbeddedFont.java example), except that you should keep the default resources and not start with empty resources. So your code should look like this:
pDAcroForm.refreshAppearances();
PDType0Font formFont = PDType0Font.load(pDDocument, ...input stream..., false);
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null)
{
resources = new PDResources();
pDAcroForm.setDefaultResources(resources);
}
final String fontName = resources.add(formFont).getName();
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g";
PDTextField field = (PDTextField) (pDAcroForm.getField("tbxTotal"));
field.setDefaultAppearance(defaultAppearanceString);
field.setValue("Total: €192.00");
Trying to save Arabic words in an editable PDF. It works all fine with English ones but when I use Arabic words, I am getting this exception:
java.lang.IllegalArgumentException:
U+0627 is not available in this font Helvetica encoding: WinAnsiEncoding
Here is how I generated PDF:
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}
That's how I made it work, I hope it would help others. Just use the font that is supported by the language that you want to use in the PDF.
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
// you can read ttf from resources as well, this is just for testing
PDFont font = PDType0Font.load(pdfDocument,new File("/path/to/font.ttf"));
String fontName = acroForm.getDefaultResources().add(pdfont).getName();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setDefaultAppearance("/"+fontName +" 0 Tf 0 g");
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}
Edited: Adding the comment of mkl
The font name and the font size are parameters of the Tf instruction, and the gray value 0 for black is the parameter for the g instruction. Parameters and instruction names must be appropriately separated.
You need a font which supports those Arabic symbols.
Once you've got a compatible font, you can load it using PDType0Font
final PDFont font = PDType0Font.load(...);
A Type 0 font is a font which references multiple other fonts' formats, and can, potentially, load all available symbols.
See also the Cookbook - working with fonts (no examples with Type 0, but still useful).
What I am trying to do here is to create text and place it onto a blank page. That page would then be overlayed onto another document and that would then be saved as one document. In 1.8 I was able to create a blank PDPage in a PDF, write text to it as needed, then overlay that PDF with another and then save or view on screen using the code below -
overlayDoc = new PDDocument();
page = new PDPage();
overlayDoc.addPage(page);
overlayObj = new Overlay();
font = PDType1Font.COURIER_OBLIQUE;
try {
contentStream = new PDPageContentStream(overlayDoc, page);
contentStream.setFont(font, 10);
}
catch (Exception e){
System.out.println("content stream failed");
}
After I created the stream, when I needed to write something to the overlay document's contentStream, I would call this method, give it my x, y coords and tell it what text to write (again, this is in my 1.8 version):
protected void writeString(int x, int y, String text) {
if (text == null) return;
try {
contentStream.moveTo(x, y);
contentStream.beginText();
contentStream.drawString(text); // deprecated. Use showText(String text)
contentStream.endText();
}
catch (Exception e){
System.out.println(text + " failed. " + e.toString());
}
}
I would call this method whenever I needed to add text and to wherever I needed to do so. After this, I would close my content stream and then merge the documents together as such:
import org.apache.pdfbox.Overlay;
Overlay overlayObj = new Overlay();
....
PDDocument finalDoc = overlayObj.overlay(overlayDoc, originalDoc);
finalDoc now contains a PDDocument which is my original PDF with text overlayed where needed. I could save it and view it as a BufferedImage on the desktop. The reason I moved to 2.0 was that first off I needed to stay on top of the most recent library and also that I was having issues putting an image onto the page (see here).
The issue I am having in this question is that 2.0 no longer has something similar to the org.apache.pdfbox.Overlay class. To confuse me even more is that there are two Overlay classes in 1.8 (org.apache.pdfbox.Overlay and org.apache.pdfbox.util.Overlay) whereas in 2.0 there is only one. The class I need (org.apache.pdfbox.Overlay), or the methods it offers at least, are not present in 2.0 as far as I can tell. I can only find org.apache.pdfbox.multipdf.Overlay.
Here's some quick code that works, it adds "deprecated" over a document and saves it elsewhere:
PDDocument overlayDoc = new PDDocument();
PDPage page = new PDPage();
overlayDoc.addPage(page);
Overlay overlayObj = new Overlay();
PDFont font = PDType1Font.COURIER_OBLIQUE;
PDPageContentStream contentStream = new PDPageContentStream(overlayDoc, page);
contentStream.setFont(font, 50);
contentStream.setNonStrokingColor(0);
contentStream.beginText();
contentStream.moveTextPositionByAmount(200, 200);
contentStream.drawString("deprecated"); // deprecated. Use showText(String text)
contentStream.endText();
contentStream.close();
PDDocument originalDoc = PDDocument.load(new File("...inputfile.pdf"));
overlayObj.setOverlayPosition(Overlay.Position.FOREGROUND);
overlayObj.setInputPDF(originalDoc);
overlayObj.setAllPagesOverlayPDF(overlayDoc);
Map<Integer, String> ovmap = new HashMap<Integer, String>(); // empty map is a dummy
overlayObj.setOutputFile("... result-with-overlay.pdf");
overlayObj.overlay(ovmap);
overlayDoc.close();
originalDoc.close();
What I did additionally to your version:
declare variables
close the content stream
set a color
set to foreground
set a text position (not a stroke path position)
add an empty map
And of course, I read the OverlayPDF source code, it shows more possibilities what you can do with the class.
Bonus content:
Do the same without using the Overlay class, which allows further manipulation of the document before saving it.
PDFont font = PDType1Font.COURIER_OBLIQUE;
PDDocument originalDoc = PDDocument.load(new File("...inputfile.pdf"));
PDPage page1 = originalDoc.getPage(0);
PDPageContentStream contentStream = new PDPageContentStream(originalDoc, page1, true, true, true);
contentStream.setFont(font, 50);
contentStream.setNonStrokingColor(0);
contentStream.beginText();
contentStream.moveTextPositionByAmount(200, 200);
contentStream.drawString("deprecated"); // deprecated. Use showText(String text)
contentStream.endText();
contentStream.close();
originalDoc.save("....result2.pdf");
originalDoc.close();
How can we extract text content from PDF file, we are using pdfbox to extract text from PDF file but we are getting header and footer is not required. I am using following java code.
PDFTextStripper stripper = null;
try {
stripper = new PDFTextStripper();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
stripper.setStartPage(pageCount);
stripper.setEndPage(pageCount);
try {
String pageText = stripper.getText(document);
System.out.println(pageText);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
You have tagged this as an itext/itextpdf question, yet you are using PdfBox. That's confusing.
You also claim that your PDF file has headers and footers. This would imply that your PDF is a Tagged PDF and that the header and the footer are marked as artifacts. If that is the case, than you should take advantage of the Tagged nature of the PDF, and extract the PDF as is done in the ParseTaggedPdf example:
TaggedPdfReaderTool readertool = new TaggedPdfReaderTool();
PdfReader reader = new PdfReader(StructuredContent.RESULT);
readertool.convertToXml(reader, new FileOutputStream(RESULT));
reader.close();
If this doesn't result in anything, you clearly don't have a Tagged PDF in which case there are no headers and footers in your document from a technical point of view. You may see headers and footers with your human eyes, but that doesn't mean that a machine sees these headers and footers. To a machine, it's just text like any other text in the page.
The ExtractPageContentArea example shows how we can define a rectangle that excludes the header and the footer when parsing for the content.
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
In this case, we have examined the document manually and we noticed that the actual text is always added inside the rectangle new Rectangle(70, 80, 490, 580). The header is added above Y coordinate 580 and below coordinate 80. By using the RegionTextRenderFilter we can extract the content excluding the content that doesn't overlap with the rectangle we have defined.