PDFBox not supporting multiple languages

PDFBox not supporting multiple languages - java

I'm trying to generate a PDF report consisting of sentences in multiple languages. For that I'm using Google NOTO fonts, but google CJK fonts don't support some of the Latin special characters. For that reason, my PDFBox is failing to generate a report or sometimes shows weird characters.
Does anyone have any appropriate solution? I tried multiple things, but was unable to find a single TTF file that can support all Unicode. I also tried falling back to different font files, but that will be too much work.
Languages I support: Japanese, German, Spanish, Portuguese, English.
Note: I don't want to use arialuni.ttf file due to licensing issues.
Can anyone suggest anything?

Here is the code that will be in release 2.0.14 in the examples subproject:
/**
* Output a text without knowing which font is the right one. One use case is a worldwide
* address list. Only LTR languages are supported, RTL (e.g. Hebrew, Arabic) are not
* supported so they would appear in the wrong direction.
* Complex scripts (Thai, Arabic, some Indian languages) are also not supported, any output
* will look weird. There is an (unfinished) effort here:
* https://issues.apache.org/jira/browse/PDFBOX-4189
*
* #author Tilman Hausherr
*/
public class EmbeddedMultipleFonts
{
public static void main(String[] args) throws IOException
{
try (PDDocument document = new PDDocument())
{
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDFont font1 = PDType1Font.HELVETICA; // always have a simple font as first one
TrueTypeCollection ttc2 = new TrueTypeCollection(new File("c:/windows/fonts/batang.ttc"));
PDType0Font font2 = PDType0Font.load(document, ttc2.getFontByName("Batang"), true); // Korean
TrueTypeCollection ttc3 = new TrueTypeCollection(new File("c:/windows/fonts/mingliu.ttc"));
PDType0Font font3 = PDType0Font.load(document, ttc3.getFontByName("MingLiU"), true); // Chinese
PDType0Font font4 = PDType0Font.load(document, new File("c:/windows/fonts/mangal.ttf")); // Indian
PDType0Font font5 = PDType0Font.load(document, new File("c:/windows/fonts/ArialUni.ttf")); // Fallback
try (PDPageContentStream cs = new PDPageContentStream(document, page))
{
cs.beginText();
List<PDFont> fonts = new ArrayList<>();
fonts.add(font1);
fonts.add(font2);
fonts.add(font3);
fonts.add(font4);
fonts.add(font5);
cs.newLineAtOffset(20, 700);
showTextMultiple(cs, "abc 한국 中国 भारत 日本 abc", fonts, 20);
cs.endText();
}
document.save("example.pdf");
}
}
static void showTextMultiple(PDPageContentStream cs, String text, List<PDFont> fonts, float size)
throws IOException
{
try
{
// first try all at once
fonts.get(0).encode(text);
cs.setFont(fonts.get(0), size);
cs.showText(text);
return;
}
catch (IllegalArgumentException ex)
{
// do nothing
}
// now try separately
int i = 0;
while (i < text.length())
{
boolean found = false;
for (PDFont font : fonts)
{
try
{
String s = text.substring(i, i + 1);
font.encode(s);
// it works! Try more with this font
int j = i + 1;
for (; j < text.length(); ++j)
{
String s2 = text.substring(j, j + 1);
if (isWinAnsiEncoding(s2.codePointAt(0)) && font != fonts.get(0))
{
// Without this segment, the example would have a flaw:
// This code tries to keep the current font, so
// the second "abc" would appear in a different font
// than the first one, which would be weird.
// This segment assumes that the first font has WinAnsiEncoding.
// (all static PDType1Font Times / Helvetica / Courier fonts)
break;
}
try
{
font.encode(s2);
}
catch (IllegalArgumentException ex)
{
// it's over
break;
}
}
s = text.substring(i, j);
cs.setFont(font, size);
cs.showText(s);
i = j;
found = true;
break;
}
catch (IllegalArgumentException ex)
{
// didn't work, will try next font
}
}
if (!found)
{
throw new IllegalArgumentException("Could not show '" + text.substring(i, i + 1) +
"' with the fonts provided");
}
}
}
static boolean isWinAnsiEncoding(int unicode)
{
String name = GlyphList.getAdobeGlyphList().codePointToName(unicode);
if (".notdef".equals(name))
{
return false;
}
return WinAnsiEncoding.INSTANCE.contains(name);
}
}
Alternatives to arialuni can be found here:
https://en.wikipedia.org/wiki/Open-source_Unicode_typefaces

Related

No glyph found after getting Text and Font from existing pdf

My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..

PDF font embedding not working using PDFBox

I am working on embedding fonts those are not embedded to PDF. For this, I am using PDFBox library to identify missing fonts using PDFont class. Using this I am able to identify list of missing fonts. But when I am trying to embed them using my local machine font's(grabbed TTF files from my local machine fonts folder), I am not able to do it, getting following result.
I am using following code to get list of (not embedded) fonts,
private static List<FontsDetails> checkAllFontsEmbeddedOrNot(PDDocument pdDocument) throws Exception {
List<FontsDetails> notEmbFonts = null;
try {
if(null != pdDocument){
PDPageTree pageTree = pdDocument.getDocumentCatalog().getPages();
notEmbFonts = new ArrayList<>();
for (PDPage pdPage : pageTree) {
PDResources resources = pdPage.getResources();
Iterable<COSName> cosNameIte = resources.getFontNames();
Iterator<COSName> names = cosNameIte.iterator();
while (names.hasNext()) {
COSName name = names.next();
PDFont font = resources.getFont(name);
boolean isEmbedded = font.isEmbedded();
if(!isEmbedded){
FontsDetails fontsDetails = new FontsDetails();
fontsDetails.setFontName(font.getName().toString());
fontsDetails.setFontSubType(font.getSubType());
notEmbFonts.add(fontsDetails);
}
}
}
}
} catch (Exception exception) {
logger.error("Exception occurred while validating fonts : ", exception);
throw new PDFUtilsException("Exception occurred while validating fonts : ",exception);
}
return notEmbFonts;
}
Following is the code which I am using to embed fonts which I am getting from above list,
public List<FontsDetails> embedFontToPdf(File pdf, FontsDetails fontToEmbed) {
ArrayList<FontsDetails> notSupportedFonts = new ArrayList<>();
try (PDDocument pdDocument = PDDocument.load(pdf)) {
LOGGER.info("Embedding font : " + fontToEmbed.getFontName());
InputStream ttfFileStream = PDFBoxOperationsUtility.class.getClassLoader()
.getResourceAsStream(fontToEmbed.getFontName() + ".ttf");//loading ttf file
if (null != ttfFileStream) {
PDFont font = PDType0Font.load(pdDocument, ttfFileStream);
PDPage pdfPage = new PDPage();
PDResources pdfResources = new PDResources();
pdfPage.setResources(pdfResources);
PDPageContentStream contentStream = new PDPageContentStream(pdDocument, pdfPage);
if (fontToEmbed.getFontSize() == 0) {
fontToEmbed.setFontSize(DEFAULT_FONT_SIZE);
}
font.encode("ANSI");
contentStream.setFont(font, fontToEmbed.getFontSize());
contentStream.close();
pdDocument.addPage(pdfPage);
pdDocument.save(pdf);
} else {
LOGGER.info("Font : " + fontToEmbed.getFontName() + " not supported");
notSupportedFonts.add(fontToEmbed);
}
} catch (Exception exception) {
notSupportedFonts.add(fontToEmbed);
LOGGER.error("Error ocurred while embedding font to pdf : " + pdf.getName(), exception);
}
return notSupportedFonts;
}
Could someone pleas help me to identify what mistake I am doing or any other approach I will need to take.

Remove space between images added into a single pdf file with iText using java.

I am trying to crate PDF files from a list of images. 4 image should cover a full page with no margin padding or anything. My problem is that the added images are separated by a white line, and I can't figure out a way to remove this separation.
public ByteArrayOutputStream createMultiTicketPdf(List<String> base64Images) {
PDFCreator creator = new PDFCreator();
Document document = creator.getDocument();
creator.setForMulti(true);
float nomargin = 0;
creator.addCustomCSS("common", "/pdf/common.css");
document.setMargins(nomargin, nomargin, nomargin, nomargin);
creator.setTemplateRelativePath("/pdf/multitickettemplate.html");
for(String base64Image : base64Images) {
try {
String parsedString = StringUtils.substringAfter(base64Image, ",");
byte[] decoded = Base64.getDecoder().decode(parsedString);
Image image = Image.getInstance(decoded);
float scaler = ((document.getPageSize().getWidth() - document.leftMargin()
- document.rightMargin()) / image.getWidth()) * 100;
image.scalePercent(scaler);
image.setPaddingTop(nomargin);
creator.addImage(Image.getInstance(image));
} catch (BadElementException | IOException e) {
LOGGER.error("Error occured:", e);
}
}
return creator.create();
}

Write arabic characters with PDFBOX [duplicate]

This question already has answers here:
Writing Arabic with PDFBOX with correct characters presentation form without being separated
(2 answers)
Closed 5 years ago.
Update 1
I'm trying to write some Arabic characters in a pdf document using pdfbox. As a result I get some strange characters. You can find below the code snippet I used for my test. Notice that the same code was used to print Latin characters without any problem.
public static void main(String[] args) throws Exception {
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
document.addPage(page);
PDPageContentStream stream = new PDPageContentStream(document, page,true, true);
//Use of a unicode font
PDFont font = PDTrueTypeFont.loadTTF(document,"C:/arialuni.ttf");
font.setFontEncoding(new WinAnsiEncoding());
stream.setFont(font, 12);
stream.beginText();
stream.moveTextPositionByAmount(40, 600);
stream.drawString("سي ججس ححسيب حسججسيبنم حح ");
stream.endText();
stream.close();
document.save("c:\\resultpdf.pdf");
document.close();
}
Thanks for your help. I tried a Unicode font downloaded from Microsoft website ,but I still have the same result.
Update 2
By using the method 'drawUnicodeString' and the mehod 'loadTTF' I got form the PDFBOX-922
I was able to write arabic charactersm but they are disconnected and ordered from left-to-right. Here are the two methods 'drawUnicodeString' and 'loadTTF'
public void drawUnicodeString(String text) throws IOException {
COSString string = new COSString();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
string.append(c >> 8);
string.append(c & 0xff);
}
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
string.writePDF(buffer);
appendRawCommands(buffer.toByteArray());
appendRawCommands(32);
appendRawCommands(getISOBytes("Tj\n"));
}
public static PDType0Font loadTTF(PDDocument doc, InputStream is)
throws IOException {
/* Load the font which we will convert to Type0 font. */
PDTrueTypeFont pdTtf = PDTrueTypeFont.loadTTF(doc, is);
TrueTypeFont ttf = pdTtf.getTTFFont();
CMAPEncodingEntry unicodeMap = null;
for (CMAPEncodingEntry candidate : ttf.getCMAP().getCmaps()) {
if (candidate.getPlatformId() == CMAPTable.PLATFORM_WINDOWS
&& candidate.getPlatformEncodingId() == CMAPTable.ENCODING_UNICODE) {
unicodeMap = candidate;
break;
}
}
if (unicodeMap == null) {
throw new RuntimeException(
"To use as CIDFont, the TTF must have a Windows platform Unicode encoding");
}
float scaling = 1000f / ttf.getHeader().getUnitsPerEm();
MyPDCIDFontType2Font pdCidFont2 = new MyPDCIDFontType2Font();
pdCidFont2.setBaseFont(pdTtf.getBaseFont());
pdCidFont2.setFontDescriptor((PDFontDescriptorDictionary) pdTtf
.getFontDescriptor());
/* Fixme -- should determine the minimum and maximum charcode in the map */
int[] cid2gid = new int[65536];
List<Float> widths = new ArrayList<Float>();
int[] widthValues = ttf.getHorizontalMetrics().getAdvanceWidth();
for (int i = 0; i < cid2gid.length; i++) {
int glyph = unicodeMap.getGlyphId(i);
cid2gid[i] = glyph;
widths.add((float) i);
widths.add((float) i);
widths.add(widthValues[glyph] * scaling);
}
pdCidFont2.setCidToGid(cid2gid);
pdCidFont2.setWidths(widths);
pdCidFont2.setDefaultWidth(widths.get(0).longValue());
/* Now construct the type0 font that we actually return */
myType0Font pdFont0 = new myType0Font();
pdFont0.setDescendantFont(pdCidFont2);
pdFont0.setDescendantFonts(new COSObject(pdCidFont2.getCOSObject()));
pdFont0.setEncoding(COSName.IDENTITY_H);
pdFont0.setBaseFont(pdTtf.getBaseFont());
// pdfont0.setToUnicode(COSName.IDENTITY_H); XXX how to express identity
// mapping as ToUnicode program? */
return pdFont0;
}
and here are the characters printed :
I don't know why these characters are disconnected

Arabic can be written by applying both PDFBOX-922 and PDFBOX-1287 .(the diff files are attached to in issues description)
I hope that the patches will be applied in the version 2.0.

i suggest you try adding ICU4J jars to your project :
ICU4J

How to set underline to PdfContentByte -- iText

I'm having trouble to set underline and overline by using PdfContentByte in iText. I want to set underline to all field in sectionArea == 1 || section Area == 3 as mentioned in getFontForFormat. So far i can only do bold style and i need it to be underlined and overlined too.
Here is the code:
public void doOutputField(Field field) {
String fieldAsString = field.toString();
BaseFont baseFont = getFontForFormat(field);
float fontSize = 11;
Point bottomLeft = bottomLeftOfField(field, 11, baseFont);
int align;
align = PdfContentByte.ALIGN_LEFT;
//PdfContentByte content
content.beginText();
content.setFontAndSize(baseFont, fontSize);
content.setColorFill(Color.BLACK);
double lineHeight = field.getOutputHeight();
content.showTextAligned(align, fieldAsString, (float) bottomLeft.x,
(float) bottomLeft.y, 0f);
bottomLeft.y -= lineHeight;
content.endText();
}
public BaseFont getFontForFormat(Field field) {
try {
if (field.getSection().getArea().getArea() == 1
|| field.getSection().getArea().getArea() == 3) {
BaseFont bf = BaseFont.createFont(BaseFont.TIMES_BOLD,
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
return bf;
} else {
BaseFont bf = BaseFont.createFont("Times-Roman",
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
return bf;
}
} catch (Exception e) {
}
return null;
}
Thanks in advance
Edit (Solved by Bruno Lowagie):
This problem can be solved by utilizing ColumnText.
if (field.getSection().getArea().getArea() == 1
|| field.getSection().getArea().getArea() == 3) {
Chunk chunk = new Chunk(fieldAsString);
chunk.setUnderline(+1f, -2f);
if (field.getSection().getArea().getArea() == 3) {
chunk.setUnderline(+1f, (float) field.getBoundHeight());
}
Font font = new Font();
font.setFamily("Times Roman");
font.setStyle(Font.BOLD);
font.setSize((float) 11);
chunk.setFont(font);
Paragraph p = new Paragraph();
p.add(chunk);
ColumnText ct = new ColumnText(content);
ct.setSimpleColumn(p, (float)bottomLeft.x, (float)bottomLeft.y,
(float)field.getBoundWidth() + (float)bottomLeft.x,
(float)field.getBoundHeight() + (float)bottomLeft.y,
(float)lineHeight, align);
try {
ct.go();
} catch (DocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Thanks

You're making it yourself difficult by using PdfContentByte.showTextAligned(). Is there any reason why you don't want to use ColumnText?
With PdfContentByte, you have to handle the text state —beginText() and endText()—, the font —setFontAndSize()—, and you can only add String values. If you want to add lines (e.g. to underline), you need moveTo(), lineTo(), stroke() operations. These operators need coordinates, so you'll need to measure the size of the line using the BaseFont in combination with the String and the font size. There's some math involved.
If you use ColumnText, you have the option of adding one line at a time using ColumnText.showTextAligned(). Or you can define a column using setSimpleColumn() and let iText take care of distributing the text over different lines. In both cases, you don't have to worry about handling the text state, nor about the font and size. ColumnText accepts Phrase objects, and these objects consists of Chunk objects for which you can define underline and overline values. In this case, iText does all the math for you.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox not supporting multiple languages - java

Related

No glyph found after getting Text and Font from existing pdf

PDF font embedding not working using PDFBox

Remove space between images added into a single pdf file with iText using java.

Write arabic characters with PDFBOX [duplicate]

How to set underline to PdfContentByte -- iText

Categories

Resources