Convert HTML to PDF with Special Characters using Java

Convert HTML to PDF with Special Characters using Java - java

I am using flying saucer with iText 2.1.7 for converting html to pdf. It works fine, but the problem occurs when there are some chinese, korean, etc characters in the html.
I get unexpected characters in my PDF instead of the normal chinese characters
I found this issue opened, so I assume there is currently no way of making flying saucer into rendering the PDF correctly?
PS: I also found this issue, but I can't understand the solution they have provided.
This is the code that I am using
String doc = file.toURI().toURL().toString();
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont (
"C:\\ARIALUNI.TTF",
BaseFont.IDENTITY_H,
BaseFont.EMBEDDED
);
renderer.setDocument(doc);
String outputFile = "report.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
Where file is the html which I am trying to convert.
Is there some other way or library to do the same?
This is the css that i am using
#font-face {
font-family: "Arial";
src: url("media/arialuni.ttf");
-fs-pdf-font-embed: embed;
-fs-pdf-font-encoding: Identity-H;
}
The HTML file that I need to convert
These are the re-compiled flying saucer jar compatible with itext 2.1..x

Your font is probably not embedded in the PDF file. ( How do I know if the fonts in a PDF file are embedded or not? )
Every font has a name, ARIALUNI.TTF defines Arial Unicode MS, you should use that.
So change this:
#font-face {
font-family: Arial1;
src: url("arialuni.ttf");
-fs-pdf-font-embed: embed;
-fs-pdf-font-encoding: Identity-H;
}
* {
font-family: Arial1;
}
To this:
#font-face {
font-family: Arial Unicode MS;
src: url("arialuni.ttf");
-fs-pdf-font-embed: embed;
-fs-pdf-font-encoding: Identity-H;
}
* {
font-family: Arial Unicode MS;
}
This way the font will be embedded.
And you don't need to call renderer.getFontResolver().addFont, the css is enough.

Try this:
font.addFont(Html2Pdfs.class.getResource("SIMSUN.TTC").toString().substring(6),BaseFont.IDENTITY_H,BaseFont.NOT_EMBEDDED)

Related

Unicode emojis with colour

I'm trying to build a instant messaging application using JavaFX. I want to implement emoji support by using unicode however, when using a font such as Google's Noto Color Emoji the emoji's only appear as greyscale. like so:
I've applied the font like so and can confirm that it is loading correctly.
#font-face {
font-family: 'Noto Color Emoji';
src: url('/fonts/NotoColorEmoji.ttf');
}
.message-widget-font {
-fx-font-family: Noto Color Emoji;
}
Applying the CSS to the Text object:
messageText.getStyleClass().add("message-widget-font");
Is there a way to get the emoji's to have colour?

Add HTML Markup using java Apache PDFBOX

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}

Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'

PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Is it possible to change the font of a parsed HTML in IText?

First let me introduce the background around this problem:
I am using the CKEditor to implement some rich text fields in a project, the editor is included through javascript, and handle the fields creating a HTML which is properly rendered by the browser.
The challenge was to include tables generated by the editor on a PDF, I have tried with Jasper Reports but it didn't work very well (the parsed HTML did not render the tables and some styles generated by CKEditor).
I have tested the IText and it worked very well, I was able to parse the tables and almost all the styles of CKEditor throught the following code:
CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(new FileInputStream(new File(CKEDITOR_CSS_FILE));
cssResolver.addCss(cssFile);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
htmlParser = new XMLParser(worker);
// Parsing the HTML generated by the editor
htmlParser.parse(new FileInputStream(HTML));
It considers the CSS correctly during the parse and output the following table to me:
My question is: Is it possible to change the font of the parsed HTML to use for example, a smaller font or a bold font?
I will have a lot of fields that need to be included on the PDF, each field would be in a different section demanding different formatting. I was not able to find any interface to allow this customization on XMLWorker.

I have changed my approach and dropped the idea of incorporate the CKEditor CSS on my report.
Instead, I customized my own CSS for the HTML generated by CKEditor, at the end I have this structure:
A custom CSS for my report containing all the configuration and formatting:
* {
font-size: 13px;
line-height: 20px;
}
table {
border-collapse: collapse;
width: 100%;
}
th, td {
padding: 6px;
}
table, th, td {
border: 1px solid black;
}
th {
font-weight: bold;
}
The result is this table:

Using a primary and a fallbackfont in Flying Saucer PDF generator

I'm having trouble getting flying saucer to use a secondary font for the glyphs/charachters which are not present in my main font.
The Java code I'm using for this purpose is more or less:
String result = getPrintHtmlContent(urlString);
result = CharacterConverter.replaceInvalidCharacters(result);
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver();
renderer.getFontResolver().addFont(FONTS_DIR_PATH + "ARIALUNI.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
renderer.getFontResolver().addFont(FONTS_DIR_PATH + "droidsans/DroidSans.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
renderer.getFontResolver().addFont(FONTS_DIR_PATH + "droidsans/DroidSansBold.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
renderer.setDocumentFromString(result, "http://" + frontendHost + ":" + frontendPort + frontendContextRoot);
renderer.layout();
renderer.createPDF(os);
And the css:
body {
font-family: "Droid Sans", "Arial Unicode MS";
}
I have also included the fonts in the css by using the #font-face rule.
I am able to get this to work using either of the fonts separately, so there seems to be no problem with flying saucer finding the fonts or the css not rendering correctly.
If I on the other hand do as above and try to use both fonts the output PDF is only using Droid Sans...
Is it even possible to use a "fallback font" in flying saucer, as it is on websites?

I asked the same question on Flying Saucer developer community and got a reply:
https://groups.google.com/forum/#!topic/flying-saucer-dev/5p00ISwnxiw
In short the answer is NO, it is not possible to use a secondary font.

Converting HTML files into PDF

I am using the following code to generate a PDF file of the HTML Report
String url = new File("Test.html").toURI().toURL().toString();
OutputStream os = new FileOutputStream("Test.pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
os.close();
I was able to use it on sample HTML files to convert to pdf. But when it comes to my real usage, the HTML content consists of various special symbols, like &,<,> that can't be parsed by XML.
I tried using CDATA, while generating HTML itself, but later found that the text around CDATA is not visible in HMTL.
Does anyone have a solution for this?

Have you tried to print to pdf from the browser? Google primo pdf for a program that we'll let you do it.

I don't know if this will help you, but you can use StringEscapeUtils from apache-commons. It has methods for escape and unescape HTML (you may use them to pre-process your HTML before PDF generation).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert HTML to PDF with Special Characters using Java - java

Try this: font.addFont(Html2Pdfs.class.getResource("SIMSUN.TTC").toString().substring(6),BaseFont.IDENTITY_H,BaseFont.NOT_EMBEDDED)

Related

Unicode emojis with colour

Add HTML Markup using java Apache PDFBOX

Is it possible to change the font of a parsed HTML in IText?

Using a primary and a fallbackfont in Flying Saucer PDF generator

Converting HTML files into PDF

Categories

Resources