create Unicode pdf file contain Bangla letter using iText in java

create Unicode pdf file contain Bangla letter using iText in java - java

I have to generate pdf file using iText in Netbeans IDE. The pdf may contain bangla letter. I already generate pdf file with Bangla letters. But the problem is Bangla letters are not in correct form.
Suppose I have to show: বরিশাল -- But pdf generate: [1]: http://i.stack.imgur.com/abwOV.jpg
Suppose I have to show: পড়ি -- But pdf generate: পড় ি
My code to generate this file:
Document document = new Document();
BaseFont unicode = BaseFont.createFont("c:/windows/fonts/NikoshBan.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font font = new Font(unicode);
PdfWriter writer=PdfWriter.getInstance(document, new FileOutputStream("TableDat.pdf"));
document.open();
document.add(new Paragraph("বরিশাল",font));
document.close();

Related

Converting ePDF to a non searchable pdf programmatically

I need to OCR a readable pdf again to generate hOCR file.
I am using,
Apache PDFBox -> parse the readable pdf
Qoppa -> OCR the pdf
Currently, I am trying iText to create a pdf after converting PDF page to image using PDFRenderer from PDfbox.
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("iTextImageExample.pdf"));
document.open();
Image img = Image.getInstance("C:\\Projects\\Data Mapping\\singularity-data-jockey-pdflib\\src\\test\\resources\\data\\pageimg.jpg");
document.add(img);
I also tried with Qoppa PDfDocument class.
The issue is, the output pdf generated from the image is low quality and OCR is not accurate. Is there any other way to convert the pdf to non searchable pdf?

HTML to PDF using flying saucer: bold text is blurry

I am using flying saucer (xhtmlrenderer) 9.1.22 to create PDF files from HTML content,
but the bold text in PDF looks blurry.
this is my code snippet:
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont("arial.ttf", BaseFont.CP1252, BaseFont.EMBEDDED);
renderer.setDocumentFromString(htmlAsString);
renderer.layout():
renderer.createPDF(output);

Add HTML Markup using java Apache PDFBOX

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}

Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'

PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Setting substitution fonts in AcroForm itext7

I have a pdf with AcroForm and need to fill it with string that contains different languages glyphs (English, Chinese, Korean, Khmer).
In iText5 I've used:
AcroFields form = stamper.getAcroFields();
form.addSubstitutionFont(arialFont);
form.addSubstitutionFont(khmerFont);
And it worked fine for Chinese and Korean, but I've faced an issue with Khmer ligatures not being rendered. Found out that I need pdfCalligraph addon to make ligatures work, but it comes with iText7 only. I've managed to add paragraphs with proper Khmer ligatures rendering (requiring typography as a dependency and loading a license key). But in AcroForm it won't do it automatically. I'm struggling to find an iText7 version of addSubstitutionFont and make it work with pdfCalligraph.
Code I've used with iText7:
PdfReader reader = new PdfReader(templatePath);
PdfDocument pdf = new PdfDocument(reader, new PdfWriter(outputPath));
Document document = new Document(pdf);
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfFont khmerFont = PdfFontFactory.createFont(pathToKhmerFont, PdfEncodings.IDENTITY_H, true);
PdfFont font = PdfFontFactory.createFont(pathToArialUnicodeFont, PdfEncodings.IDENTITY_H, true);
pdf.addFont(khmerFont);
pdf.addFont(font);
FontSet set = new FontSet();
set.addFont(pathToKhmerFont);
set.addFont(pathToArialUnicodeFont);
document.setFontProvider(new FontProvider(set));
document.setProperty(Property.FONT, "Arial");
form.setNeedAppearances(true);
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
PdfFormField tf = form.getField("Text3");
tf.setValue(content);
// tf.setFont(khmerFont);
tf.regenerateField();
// add a paragraph just to check pdfCalligraph works
document.add(new Paragraph(content));
pdf.close();
String used to test proper rendering: "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일"
iText5 in form field without pdfCalligraph, but with substitution fonts:
iText7 in form field with pdfCalligraph loaded (set arial font field.setFont(arialFont)):
iText7 in form field with pdfCalligraph loaded (set khmer font field.setFont(khmerFont)):
iText7 same document but in a paragraph instead of form field with pdfCalligraph loaded (It is an expected resut, so it does use pdfCalligraph for paragraphs, but not for form fields):
So, as you can see there're basically 2 issues:
How do I addSubstitutionFont in iText7?
How do I use pdfCalligraph in PdfFormField appearance?
I've also checked if pdfCalligraph works in text form and it looks like it does not. Here is a code I've used to check it:
LicenseKey.loadLicenseFile(path_to_license);
String outputPath = path_to_output_doc;
PdfDocument pdf = new PdfDocument(new PdfWriter(outputPath));
Document document = new Document(pdf);
// prepare fonts for pdfCalligraph to use
FontSet set = new FontSet();
set.addFont("/path_to/Khmer.ttf");
set.addFont("/path_to/ArialUnicodeMS.ttf");
FontProvider fontProvider = new FontProvider(set);
document.setFontProvider(fontProvider);
document.setProperty(Property.FONT, "Arial");
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
// Add a paragraph to check if pdfCalligraph works
document.add(new Paragraph(content));
// Add a form text field
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfTextFormField field = PdfFormField.createText(pdf, new Rectangle(36, 700, 400, 30), "test");
field.setValue(content);
form.addField(field);
document.close();
Output with pdfCalligraph dependency loaded (as you can see paragraph rendered properly, but in form all non-halvetica chars just ignored:
Output without pdfCalligraph dependency loaded (as you can see paragraph is not rendered properly which is expected, the form field looks same as with loaded pdfCalligraph):
Am I missing something?

PDFTable Itext arabic

I have coded java code and I wanted Arabic words to be displayed at PdfPTable which was asses to itext document to create PDF document
as attached picture "???" is Arabic code '
PdfPTable header = new PdfPTable(6);
PdfPTable tbame = new PdfPTable(1);
tbame.addCell(" >>>>>> " + install.getCustId().getFullName() + " <<<<<<");
tbame.setHorizontalAlignment(PdfPTable.ALIGN_CENTER);
tbame.setLockedWidth(false);
tbame.setExtendLastRow(false);
tbame.setWidthPercentage(100);
header.addCell("End");
header.addCell("Start");

Please read the documentation and you'll find out that the addCell(String content) method can not be used to add Arabic text for two reasons:
When you use this method, the default font Helvetica is used. You need to use a font that knows how to draw Arabic shapes. This is explained in the answer to this question: Itext Arabic Font coming as question marks
Arabic is written from right to left, which means that you need to change the run direction of the content of the cell as is explained in my answer to this question: RTL not working in pdf generation with itext 5.5 for Arabic text
A code snippet:
BaseFont bf = BaseFont.createFont("c:/WINDOWS/Fonts/arialuni.ttf",
BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font font = new Font(bf, 12);
Phrase phrase = new Phrase(
"\u0644\u0648\u0631\u0627\u0646\u0633 \u0627\u0644\u0639\u0631\u0628", font);
PdfPCell cell = new PdfPCell(phrase);
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
table.addCell(cell);
If you don't have access to the font arialuni.ttf, you'll have to find another font that contains Arabic glyphs.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

create Unicode pdf file contain Bangla letter using iText in java - java

Related

Converting ePDF to a non searchable pdf programmatically

HTML to PDF using flying saucer: bold text is blurry

Add HTML Markup using java Apache PDFBOX

Setting substitution fonts in AcroForm itext7

PDFTable Itext arabic

Categories

Resources