PDF Generation having multi-lingual text using Flying Saucer - java

I am trying to print Arabic and English text in PDF using Flying Saucer library. Here's my code :
String inputFile = "D:/test.xhtml";
String url = new File(inputFile).toURI().toURL().toString();
String outputFile = "D:/doc.pdf";
OutputStream os = new FileOutputStream(outputFile);
ITextRenderer renderer = new ITextRenderer();
ITextFontResolver resolver = renderer.getFontResolver();
resolver.addFont("D:/arialuni.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
os.close();
and my XHTML file has following data enclosed in paragraph tags:
اب اب اب اب Hello
The output generated displays only English characters but not Arabic glyphs. Please help.

for some reason, if no specific font is used, the generated PDF uses some kind of default (probably Helvetica) font, that contains a very limited character set, that obviously does not contain the Greek code page.
Reference
Arial is a pretty standard font, installed by default in most operating system, and implements a wide variety of alphabets (including Greek).

Related

PdfBox write hindi characters in pdf file

I tried many things to write hindi characters using Apache PdfBox but seems its existing issue in the library.
I tried many font files available, Can someone really help me out in this.
I tried following :
PDDocument doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage(page);
PDFont font = PDTrueTypeFont.loadTTF( doc, new FileInputStream(new File("D:\\Data\\fonts\\dn.ttf")));
font.setFontEncoding(new WinAnsiEncoding());
PDPageContentStream content = new PDPageContentStream( doc, page, true, false );
content.setFont(font, 10);
content.beginText();
content.moveTextPositionByAmount( 200, 100 );
content.drawString( "हिंदी" ); // Writing word "Hindi" in hindi language.
content.endText();
content.close();
doc.save( new FileOutputStream(new File("D:\\testOutput1.pdf")));
doc.close();
It's working for me in PDFBox.
The trick here is to use non-unicode string instead of unicode string.
Use Kruti Dev Font given in below link.
Then convert your unicode string to non-unicode string.
And finally use that converted string in your code.
That means replace this like
content.drawString( "हिंदी" ); // Writing word "Hindi" in hindi language.
With this line
content.drawString( "fganh" ); // Writing word "Hindi" in hindi language.
Convert Unicode (Mangal) To Kruti Dev Font
I think this cannot be done using PdfBox as there are lot of issues with it.
I tried many fonts and the encoding types of PdfBox but failed to write in Hindi.
At the end I tried it in Node Js express pdfmaker() which converts Html to PDF, However I had issues on my Linux server and I installed appropriate ttf font and it worked !

How to set character encoding for PDFBox

I'm bulding a pdf-parser using Apache PDFBox, after parsing the plain text i run some algorithms and in the end output a json-file. For some pdf files the output file contains utf-8 encoding, for other pdfs it contains some form of what seems to be latin-1 encoding (spaces show up as "\xa0" when the json-file is opened in python). I assume this must be a consequence of the fonts or some other characteristic of the pdf?
My code to read the plain text is as follows
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
//Closing the document
document.close();
I've tried just saving the plain text:
PrintWriter out = new PrintWriter(outPath + ".txt");
out.print(text);
Even opening this plain text file in python yields "\xa0" characters instead of space if the file is read into a dictionary , yielding the following results:
dict_keys(['1.\xa0\lorem\xa0ipsum', '2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum', '3.\xa0\lorem', '4.\xa0\lorem\xa0ipsum', '5.\xa0\lorem\xa0ipsum'])
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
If you want to make sure your PrintWriter uses UTF-8 encoding, say so in the constructor:
PrintWriter out = new PrintWriter(outPath + ".txt", "UTF-8");

Error trying to show Spanish or French characters with PDFBOX

I´m trying to create a PDF with PDFBOX-2.0.0-SNAPSHOT but I´m having problems and errors.
This is the typical Hello World example with Spanish and French characters:
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDType1Font font = PDType1Font.HELVETICA;
PDPageContentStream stream = new PDPageContentStream(document, page);
String text = "áÁÀà";
stream.beginText();
stream.setFont(font, 12);
stream.newLineAtOffset(100, 700);
stream.showText(text);
stream.endText();
stream.close();
document.save("sample.pdf");
document.close();
And I get this error:
sep 02, 2015 12:42:43 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
ADVERTENCIA: Using fallback font ArialMT for base font ZapfDingbats
Exception in thread "main" java.lang.IllegalArgumentException: This font type only supports 8-bit code points
If I load arialuni.ttf font it compiles but only get question marks in the PDF file.
I have tried PDFBOX 1.8 and doesn´t work either.
Any idea?
Thanks in advance.
UPDATE:
After some test I realized that if you change the encoding of the project (at least in Intellij IDEA) and don´t retype the problematic characters in the code, the new encoding doesn´t take effect.
The PDType1Font.XXX are fonts which are provided by the PDF viewers itself which don't support unicode. You should be able to use a TTF font like on: https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/EmbeddedFonts.java
PDType0Font font = PDType0Font.load(document, new File("path/YourFont.ttf"));

Securing PDF Generated from iTextPdf

I have made a software that generate a pdf as the part of its function, I am using iTextPDF Java library to generate PDF. For a demo version of my software, I added text watermarking (like "demo software") by use of following code
PdfContentByte under = writer.getDirectContentUnder();
BaseFont baseFont = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.WINANSI, BaseFont.EMBEDDED);
under.beginText();
under.setColorFill(BaseColor.RED);
under.setFontAndSize(baseFont, 25);
under.showTextAligned(PdfContentByte.ALIGN_CENTER," demo software",250, 470,55);
under.endText();
After it I converted it to .docx format using PDF to Word converter and the resultant docx file does not contain the watermark also the contents are easily editable so as a result the sole purpose of giving demo software is vanished.
How can I achieve permanent watermarking so that pdf to word converter wont be able to remove it.
One idea come to my mind is that instead of putting the text in the pdf there should be a way of converting all the text of a page first into an image then making the pdf comprising of those images. But I am unsure on how to achieve this using iTextPdf.
You can encrypt your PDF so that it cannot be modified without an owner password, after you have generated your PDF, create a PDFStamper with your PDF as input
and encrypt the pdf like the following:
final PdfReader reader = new PdfReader(your_input_stream);
final PdfStamper stamper = new PdfStamper(reader, your_output_stream);
stamper.setEncryption(PdfWriter.ENCRYPTION_AES_128 | PdfWriter.DO_NOT_ENCRYPT_METADATA,
"your_user_password", "your_owner_password", PdfWriter.ALLOW_PRINTING);
stamper.close();
As a side note, i would recommend not using a hardcoded owner password; since you have no need for the owner password after the file has been generated, I would suggest making it a SHA hash of a random string of say 20 alphanumeric characters.

how to handle with writing to pdf file chinese characters

i am trying to get text from properties file that he is coded in utf-8 and write it in to a PDF file using document object in java .
Document document = new Document();
File file = new File(FILES_PATH + ".pdf");
FileOutputStream fos = new FileOutputStream(file);
PdfWriter.getInstance(document, fos);
.
.
.
pdfTable table;
document.add(table);
document.close();
when i get just the value from property is ignores Chinese characters .
when i try to encode the string instead Chinese characters i get
strange words or "?".
tried to code it in utf-8 , iso-8859-1 , gbk or gb3212.
need help that PDF file will be able to get Chinese characters
It will not work that way.
In order to display Unicode character in PDF, that is not in build-in PDF fonts, you need to specify custom font for the text frangment and create the separate fragment for each text fragment, that is covered by given font. You need also to embed the used fonts into PDF document (so please consider, if the licence for the fonts you use enables distributing them).
So each String could be rendered using many fonts. But iText has the class FontSelector, that does that task:
FontSelector selector = new FontSelector();
BaseFont bf1 = BaseFont.createFont(fontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
bf1.setSubset(true);
Font font1 = new Font(baseFont, 12, Font.BOLD);
selector.addFont(font1);
// ... do that with all fonts you need
Phrase ph = selector.process(TEXT);
document.add(new Paragraph(ph));
More complex example you can find in my article: Using dynamic fonts for international texts in iText

Categories