PDFBox create oversized pages - java

when I open my PDF created by PDFBOX with PAGE_SIZE_A4, in Adobe to print. It says that it will be "Shrink oversized pages" to 96%. Even if decrease the page size by myself it shows "Shrink oversized pages" to 100%.
I know it is may a duplicate to: How to set Page Scaling option in Apache PDfBox. But this is already 2 years ago.
Usign: latest pdfbox 1.8.9
My example code:
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4); // new PDPage(595.27563f, 841.8898);
document.addPage(page);
PDPageContentStream cs = new PDPageContentStream(document, page);
/* With or without content */
cs.close();
document.save(pdfFile);
document.close();
The workaround with images is not an option.
iText is not an option.
Thank you.

Related

Add HTML Markup using java Apache PDFBOX

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}
Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'
PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Adding pages to PDF/A file with PDFBox without losing PDF/A validity

I'm developing a Java application that has to process a folder with PDF/A files, adding a page with some information to each of them using Apache's PDFBox library. The problem is that the output PDF file after adding the information is not PDF/A anymore. This is a validation test from the website: https://www.pdf-online.com/osa/validate.aspx:
And this is the relevant part of the code that I use to generate the PDF file:
String pdfFileName = this.baseFolder+this.extendedPDFFileName;
File file = new File(pdfFileName);
PDDocument pdfFile = PDDocument.load(file);
PDPage pag = new PDPage();
// As a test, simply adding a page makes the PDF unvalid as PDF/A
pdfFile.addPage(pag);
pdfFile.save(file);
pdfFile.close();
What could I do to keep the PDF/A format validity? Thanks in advance,
As Tilman Hausherr suggested, the problem has been solved by adding a PDResources object to the new page, like this:
pag.setResources(new PDResources());
Now I'm having troubles with the embedded fonts, but this is another question :)
Many thanks!
You create a normal PDF in your code, you should create a valid PDF/A from the start.
Here's a link: https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html

iText PDF add text in absolute position on top of the 1st page

I have a script that creates a PDF file and writes contents to it. After the execution is complete I need to write the status (fail, success) to the PDF, but the status should be on the top of the page. So the solution I came up with is to use absolute positioned text. Below is my code for the same
PdfContentByte cb = writer.DirectContent;
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
cb.SaveState();
cb.BeginText();
cb.MoveText(700, 30);
cb.SetFontAndSize(bf, 12);
cb.ShowText("My status");
cb.EndText();
cb.RestoreState();
But as the PDF creates multiple pages, this is added to the last page of the PDF. How can I add it to the 1st page??
Also is there a way to calculate the top coordinates of the page, ie the top-left coordinate?
iText was written with internet applications in mind. It was designed to flush content from memory as soon as possible: if a page is finished, that page is sent to the OutputStream and there is no way to return to that page.
That doesn't mean your requirement is impossible. PDF has a concept known as Form XObject. In iText, this concept is implemented under the name PdfTemplate. Such a PdfTemplate is a rectangular canvas with a fixed size that can be added to a page without being part of that page.
An example should clarify what that means. Please take a look at the WriteOnFirstPage example. In this example, we create a PdfTemplate like this:
PdfContentByte cb = writer.getDirectContent();
PdfTemplate message = cb.createTemplate(523, 50);
This message object refers to a Form XObject. It is a piece of content that is external to the page content.
We wrap the PdfTemplate inside an Image object. By doing so, we can add the Form XObject to the document just like any other object:
Image header = Image.getInstance(message);
document.add(header);
Now we can add as much data as we want:
for (int i = 0; i < 100; i++) {
document.add(new Paragraph("test"));
}
Adding 100 "test" lines will cause iText to create 3 pages. Once we're on page 3, we no longer have access to page 1, but we can still write content to the message object:
ColumnText ct = new ColumnText(message);
ct.setSimpleColumn(new Rectangle(0, 0, 523, 50));
ct.addElement(
new Paragraph(
String.format("There are %s pages in this document", writer.getPageNumber())));
ct.go();
If you check the resulting PDF write_on_first_page.pdf, you'll notice that the text we've added last is indeed on the first page.

Editing an existing PDF without using iText

I want to add an index page to existing PDF file. And add page numbers to the page of the pdf file.
All the suggested solutions point towards creating a new pdf and merging the existing pdf file with the new one.
Is there any other way for this ??
Also I dont want to use itext since its not free for commercial uses.
According to your comments to the original question, you think in PDFBox
for adding a new page & content, you need to create a new pdf add new content and then merge the existing pdf. I wanted to avoid the merging step. Renaming is not the issue
You might want to try something like this:
PDDocument doc = PDDocument.load(new FileInputStream(new File("original.pdf")));
PDPage page = new PDPage();
// fill page
doc.addPage(page);
doc.save("original-plus-page.pdf");
EDIT: In a comment to the answer the question arose how to insert a new page at specific index(page number). To do this, obviously the doc.addPage(page) has to be changed somehow.
Originally this PDDocument method is defined like this:
/**
* This will add a page to the document. This is a convenience method, that
* will add the page to the root of the hierarchy and set the parent of the
* page to the root.
*
* #param page The page to add to the document.
*/
public void addPage( PDPage page )
{
PDPageNode rootPages = getDocumentCatalog().getPages();
rootPages.getKids().add( page );
page.setParent( rootPages );
rootPages.updateCount();
}
We merely need a similar function which merely does not simply add the page to the kids but instead adds it at a given index. Thus a helper method like the following in our code will do:
public static void addPage(PDDocument doc, int index, PDPage page)
{
PDPageNode rootPages = doc.getDocumentCatalog().getPages();
rootPages.getKids().add(index, page);
page.setParent(rootPages);
rootPages.updateCount();
}
If you now replace the line
doc.addPage(page);
in the code of the original answer by
addPage(doc, page, 0);
the empty page is added up front.

PDFBox change page sizes and save it again

First of all, sorry for my bad english.
I´m trying to remove the Header and Footer of a PDF page, it´s necessary to search at the page break for some words, but it´s impossible with the header and footer, so it´s necessary to crop it and then convert to text than it´s "possible" to search the words.
I´m doing it:
PDDocument pdDoc = PDDocument.load("document.pdf");
PDPage page = (PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0);
PDRectangle rectangle = new PDRectangle();
rectangle.setUpperRightY(page.findCropBox().getUpperRightY() - 100);
rectangle.setLowerLeftY(page.findCropBox().getLowerLeftY());
rectangle.setUpperRightX(page.findCropBox().getUpperRightY());
rectangle.setLowerLeftX(page.findCropBox().getLowerLeftX());
page.setMediaBox(rectangle);
PDDocument document = new PDDocument();
document.addPage(page);
document.save("newDocument.pdf");
document.close();
But when I change it to HTML, it steal keeps the text that was hidden. Is there any way to save it withou the croped area and convert to html correctly?
Thanks.
Best regard´s.

Categories