Add HTML Markup using java Apache PDFBOX

Add HTML Markup using java Apache PDFBOX - java

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}

Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'

PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Related

Setting substitution fonts in AcroForm itext7

I have a pdf with AcroForm and need to fill it with string that contains different languages glyphs (English, Chinese, Korean, Khmer).
In iText5 I've used:
AcroFields form = stamper.getAcroFields();
form.addSubstitutionFont(arialFont);
form.addSubstitutionFont(khmerFont);
And it worked fine for Chinese and Korean, but I've faced an issue with Khmer ligatures not being rendered. Found out that I need pdfCalligraph addon to make ligatures work, but it comes with iText7 only. I've managed to add paragraphs with proper Khmer ligatures rendering (requiring typography as a dependency and loading a license key). But in AcroForm it won't do it automatically. I'm struggling to find an iText7 version of addSubstitutionFont and make it work with pdfCalligraph.
Code I've used with iText7:
PdfReader reader = new PdfReader(templatePath);
PdfDocument pdf = new PdfDocument(reader, new PdfWriter(outputPath));
Document document = new Document(pdf);
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfFont khmerFont = PdfFontFactory.createFont(pathToKhmerFont, PdfEncodings.IDENTITY_H, true);
PdfFont font = PdfFontFactory.createFont(pathToArialUnicodeFont, PdfEncodings.IDENTITY_H, true);
pdf.addFont(khmerFont);
pdf.addFont(font);
FontSet set = new FontSet();
set.addFont(pathToKhmerFont);
set.addFont(pathToArialUnicodeFont);
document.setFontProvider(new FontProvider(set));
document.setProperty(Property.FONT, "Arial");
form.setNeedAppearances(true);
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
PdfFormField tf = form.getField("Text3");
tf.setValue(content);
// tf.setFont(khmerFont);
tf.regenerateField();
// add a paragraph just to check pdfCalligraph works
document.add(new Paragraph(content));
pdf.close();
String used to test proper rendering: "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일"
iText5 in form field without pdfCalligraph, but with substitution fonts:
iText7 in form field with pdfCalligraph loaded (set arial font field.setFont(arialFont)):
iText7 in form field with pdfCalligraph loaded (set khmer font field.setFont(khmerFont)):
iText7 same document but in a paragraph instead of form field with pdfCalligraph loaded (It is an expected resut, so it does use pdfCalligraph for paragraphs, but not for form fields):
So, as you can see there're basically 2 issues:
How do I addSubstitutionFont in iText7?
How do I use pdfCalligraph in PdfFormField appearance?
I've also checked if pdfCalligraph works in text form and it looks like it does not. Here is a code I've used to check it:
LicenseKey.loadLicenseFile(path_to_license);
String outputPath = path_to_output_doc;
PdfDocument pdf = new PdfDocument(new PdfWriter(outputPath));
Document document = new Document(pdf);
// prepare fonts for pdfCalligraph to use
FontSet set = new FontSet();
set.addFont("/path_to/Khmer.ttf");
set.addFont("/path_to/ArialUnicodeMS.ttf");
FontProvider fontProvider = new FontProvider(set);
document.setFontProvider(fontProvider);
document.setProperty(Property.FONT, "Arial");
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
// Add a paragraph to check if pdfCalligraph works
document.add(new Paragraph(content));
// Add a form text field
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfTextFormField field = PdfFormField.createText(pdf, new Rectangle(36, 700, 400, 30), "test");
field.setValue(content);
form.addField(field);
document.close();
Output with pdfCalligraph dependency loaded (as you can see paragraph rendered properly, but in form all non-halvetica chars just ignored:
Output without pdfCalligraph dependency loaded (as you can see paragraph is not rendered properly which is expected, the form field looks same as with loaded pdfCalligraph):
Am I missing something?

iText PDF add text in absolute position on top of the 1st page

I have a script that creates a PDF file and writes contents to it. After the execution is complete I need to write the status (fail, success) to the PDF, but the status should be on the top of the page. So the solution I came up with is to use absolute positioned text. Below is my code for the same
PdfContentByte cb = writer.DirectContent;
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
cb.SaveState();
cb.BeginText();
cb.MoveText(700, 30);
cb.SetFontAndSize(bf, 12);
cb.ShowText("My status");
cb.EndText();
cb.RestoreState();
But as the PDF creates multiple pages, this is added to the last page of the PDF. How can I add it to the 1st page??
Also is there a way to calculate the top coordinates of the page, ie the top-left coordinate?

iText was written with internet applications in mind. It was designed to flush content from memory as soon as possible: if a page is finished, that page is sent to the OutputStream and there is no way to return to that page.
That doesn't mean your requirement is impossible. PDF has a concept known as Form XObject. In iText, this concept is implemented under the name PdfTemplate. Such a PdfTemplate is a rectangular canvas with a fixed size that can be added to a page without being part of that page.
An example should clarify what that means. Please take a look at the WriteOnFirstPage example. In this example, we create a PdfTemplate like this:
PdfContentByte cb = writer.getDirectContent();
PdfTemplate message = cb.createTemplate(523, 50);
This message object refers to a Form XObject. It is a piece of content that is external to the page content.
We wrap the PdfTemplate inside an Image object. By doing so, we can add the Form XObject to the document just like any other object:
Image header = Image.getInstance(message);
document.add(header);
Now we can add as much data as we want:
for (int i = 0; i < 100; i++) {
document.add(new Paragraph("test"));
}
Adding 100 "test" lines will cause iText to create 3 pages. Once we're on page 3, we no longer have access to page 1, but we can still write content to the message object:
ColumnText ct = new ColumnText(message);
ct.setSimpleColumn(new Rectangle(0, 0, 523, 50));
ct.addElement(
new Paragraph(
String.format("There are %s pages in this document", writer.getPageNumber())));
ct.go();
If you check the resulting PDF write_on_first_page.pdf, you'll notice that the text we've added last is indeed on the first page.

PDFBox create oversized pages

when I open my PDF created by PDFBOX with PAGE_SIZE_A4, in Adobe to print. It says that it will be "Shrink oversized pages" to 96%. Even if decrease the page size by myself it shows "Shrink oversized pages" to 100%.
I know it is may a duplicate to: How to set Page Scaling option in Apache PDfBox. But this is already 2 years ago.
Usign: latest pdfbox 1.8.9
My example code:
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4); // new PDPage(595.27563f, 841.8898);
document.addPage(page);
PDPageContentStream cs = new PDPageContentStream(document, page);
/* With or without content */
cs.close();
document.save(pdfFile);
document.close();
The workaround with images is not an option.
iText is not an option.
Thank you.

How I use a custom font with the Batik SVG library?

I'm working on a Java program which creates templates for clothes. The user enters the word they want to see on the item of clothing and the system creates a PDF template. To create the template I create an SVG document programatically then use Batik to transcode the SVG to the PDF format.
My client now wants to be able to use custom fonts to create the templates. I was wondering if it's possible to use fonts like a TTF with the Batik transcoder? If so how do you go about setting up the SVG?

First you need to transform the font file from TTF to SVG with batik's ttf2svg, once you have the converted file you have to add reference in the 'defs' section of your SVG document.
this is how i did it:
Element defs = doc.createElementNS(svgNS, "defs");
Element fontface = doc.createElementNS(svgNS, "font-face");
fontface.setAttributeNS(null, "font-family", "DroidSansRegular");
Element fontfacesrc = doc.createElementNS(svgNS, "font-face-src");
Element fontfaceuri = doc.createElementNS(svgNS, "font-face-uri");
fontfaceuri.setAttributeNS(svgNS, "xlink:href", "fonts/DroidSans-webfont.svg#DroidSansRegular");
Element fontfaceformat = doc.createElementNS(svgNS, "font-face-format");
fontfaceformat.setAttributeNS(svgNS, "string", "svg");
fontfaceuri.appendChild(fontfaceformat);
fontfacesrc.appendChild(fontfaceuri);
fontface.appendChild(fontfacesrc);
defs.appendChild(fontface);
svgRoot.appendChild(defs);
when creating the text element set the font family just like any other font
Element txtElem = doc.createElementNS(svgNS, "text");
txtElem.setAttributeNS(svgNS, "style", "font-family:DroidSansRegular;font-size:" + fontsize + ";stroke:#000000;#fill:#00ff00;");
txtElem.setTextContent("some text");
svgRoot.appendChild(txtElem);
i got the info from this article: http://frabru.de/c.php/article/SVGFonts-usage hope it helps.

Just add the following element as a child of <svg/>: <style type="text/css"> Then, similar to HTML...

PDFBox change page sizes and save it again

First of all, sorry for my bad english.
I´m trying to remove the Header and Footer of a PDF page, it´s necessary to search at the page break for some words, but it´s impossible with the header and footer, so it´s necessary to crop it and then convert to text than it´s "possible" to search the words.
I´m doing it:
PDDocument pdDoc = PDDocument.load("document.pdf");
PDPage page = (PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0);
PDRectangle rectangle = new PDRectangle();
rectangle.setUpperRightY(page.findCropBox().getUpperRightY() - 100);
rectangle.setLowerLeftY(page.findCropBox().getLowerLeftY());
rectangle.setUpperRightX(page.findCropBox().getUpperRightY());
rectangle.setLowerLeftX(page.findCropBox().getLowerLeftX());
page.setMediaBox(rectangle);
PDDocument document = new PDDocument();
document.addPage(page);
document.save("newDocument.pdf");
document.close();
But when I change it to HTML, it steal keeps the text that was hidden. Is there any way to save it withou the croped area and convert to html correctly?
Thanks.
Best regard´s.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Add HTML Markup using java Apache PDFBOX - java

Related

Setting substitution fonts in AcroForm itext7

iText PDF add text in absolute position on top of the 1st page

PDFBox create oversized pages

How I use a custom font with the Batik SVG library?

PDFBox change page sizes and save it again

Categories

Resources