Write arabic characters with PDFBOX [duplicate]

Write arabic characters with PDFBOX [duplicate] - java

This question already has answers here:
Writing Arabic with PDFBOX with correct characters presentation form without being separated
(2 answers)
Closed 5 years ago.
Update 1
I'm trying to write some Arabic characters in a pdf document using pdfbox. As a result I get some strange characters. You can find below the code snippet I used for my test. Notice that the same code was used to print Latin characters without any problem.
public static void main(String[] args) throws Exception {
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
document.addPage(page);
PDPageContentStream stream = new PDPageContentStream(document, page,true, true);
//Use of a unicode font
PDFont font = PDTrueTypeFont.loadTTF(document,"C:/arialuni.ttf");
font.setFontEncoding(new WinAnsiEncoding());
stream.setFont(font, 12);
stream.beginText();
stream.moveTextPositionByAmount(40, 600);
stream.drawString("سي ججس ححسيب حسججسيبنم حح ");
stream.endText();
stream.close();
document.save("c:\\resultpdf.pdf");
document.close();
}
Thanks for your help. I tried a Unicode font downloaded from Microsoft website ,but I still have the same result.
Update 2
By using the method 'drawUnicodeString' and the mehod 'loadTTF' I got form the PDFBOX-922
I was able to write arabic charactersm but they are disconnected and ordered from left-to-right. Here are the two methods 'drawUnicodeString' and 'loadTTF'
public void drawUnicodeString(String text) throws IOException {
COSString string = new COSString();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
string.append(c >> 8);
string.append(c & 0xff);
}
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
string.writePDF(buffer);
appendRawCommands(buffer.toByteArray());
appendRawCommands(32);
appendRawCommands(getISOBytes("Tj\n"));
}
public static PDType0Font loadTTF(PDDocument doc, InputStream is)
throws IOException {
/* Load the font which we will convert to Type0 font. */
PDTrueTypeFont pdTtf = PDTrueTypeFont.loadTTF(doc, is);
TrueTypeFont ttf = pdTtf.getTTFFont();
CMAPEncodingEntry unicodeMap = null;
for (CMAPEncodingEntry candidate : ttf.getCMAP().getCmaps()) {
if (candidate.getPlatformId() == CMAPTable.PLATFORM_WINDOWS
&& candidate.getPlatformEncodingId() == CMAPTable.ENCODING_UNICODE) {
unicodeMap = candidate;
break;
}
}
if (unicodeMap == null) {
throw new RuntimeException(
"To use as CIDFont, the TTF must have a Windows platform Unicode encoding");
}
float scaling = 1000f / ttf.getHeader().getUnitsPerEm();
MyPDCIDFontType2Font pdCidFont2 = new MyPDCIDFontType2Font();
pdCidFont2.setBaseFont(pdTtf.getBaseFont());
pdCidFont2.setFontDescriptor((PDFontDescriptorDictionary) pdTtf
.getFontDescriptor());
/* Fixme -- should determine the minimum and maximum charcode in the map */
int[] cid2gid = new int[65536];
List<Float> widths = new ArrayList<Float>();
int[] widthValues = ttf.getHorizontalMetrics().getAdvanceWidth();
for (int i = 0; i < cid2gid.length; i++) {
int glyph = unicodeMap.getGlyphId(i);
cid2gid[i] = glyph;
widths.add((float) i);
widths.add((float) i);
widths.add(widthValues[glyph] * scaling);
}
pdCidFont2.setCidToGid(cid2gid);
pdCidFont2.setWidths(widths);
pdCidFont2.setDefaultWidth(widths.get(0).longValue());
/* Now construct the type0 font that we actually return */
myType0Font pdFont0 = new myType0Font();
pdFont0.setDescendantFont(pdCidFont2);
pdFont0.setDescendantFonts(new COSObject(pdCidFont2.getCOSObject()));
pdFont0.setEncoding(COSName.IDENTITY_H);
pdFont0.setBaseFont(pdTtf.getBaseFont());
// pdfont0.setToUnicode(COSName.IDENTITY_H); XXX how to express identity
// mapping as ToUnicode program? */
return pdFont0;
}
and here are the characters printed :
I don't know why these characters are disconnected

Arabic can be written by applying both PDFBOX-922 and PDFBOX-1287 .(the diff files are attached to in issues description)
I hope that the patches will be applied in the version 2.0.

i suggest you try adding ICU4J jars to your project :
ICU4J

Related

No glyph found after getting Text and Font from existing pdf

My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..

PDFBox not supporting multiple languages

I'm trying to generate a PDF report consisting of sentences in multiple languages. For that I'm using Google NOTO fonts, but google CJK fonts don't support some of the Latin special characters. For that reason, my PDFBox is failing to generate a report or sometimes shows weird characters.
Does anyone have any appropriate solution? I tried multiple things, but was unable to find a single TTF file that can support all Unicode. I also tried falling back to different font files, but that will be too much work.
Languages I support: Japanese, German, Spanish, Portuguese, English.
Note: I don't want to use arialuni.ttf file due to licensing issues.
Can anyone suggest anything?

Here is the code that will be in release 2.0.14 in the examples subproject:
/**
* Output a text without knowing which font is the right one. One use case is a worldwide
* address list. Only LTR languages are supported, RTL (e.g. Hebrew, Arabic) are not
* supported so they would appear in the wrong direction.
* Complex scripts (Thai, Arabic, some Indian languages) are also not supported, any output
* will look weird. There is an (unfinished) effort here:
* https://issues.apache.org/jira/browse/PDFBOX-4189
*
* #author Tilman Hausherr
*/
public class EmbeddedMultipleFonts
{
public static void main(String[] args) throws IOException
{
try (PDDocument document = new PDDocument())
{
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDFont font1 = PDType1Font.HELVETICA; // always have a simple font as first one
TrueTypeCollection ttc2 = new TrueTypeCollection(new File("c:/windows/fonts/batang.ttc"));
PDType0Font font2 = PDType0Font.load(document, ttc2.getFontByName("Batang"), true); // Korean
TrueTypeCollection ttc3 = new TrueTypeCollection(new File("c:/windows/fonts/mingliu.ttc"));
PDType0Font font3 = PDType0Font.load(document, ttc3.getFontByName("MingLiU"), true); // Chinese
PDType0Font font4 = PDType0Font.load(document, new File("c:/windows/fonts/mangal.ttf")); // Indian
PDType0Font font5 = PDType0Font.load(document, new File("c:/windows/fonts/ArialUni.ttf")); // Fallback
try (PDPageContentStream cs = new PDPageContentStream(document, page))
{
cs.beginText();
List<PDFont> fonts = new ArrayList<>();
fonts.add(font1);
fonts.add(font2);
fonts.add(font3);
fonts.add(font4);
fonts.add(font5);
cs.newLineAtOffset(20, 700);
showTextMultiple(cs, "abc 한국 中国 भारत 日本 abc", fonts, 20);
cs.endText();
}
document.save("example.pdf");
}
}
static void showTextMultiple(PDPageContentStream cs, String text, List<PDFont> fonts, float size)
throws IOException
{
try
{
// first try all at once
fonts.get(0).encode(text);
cs.setFont(fonts.get(0), size);
cs.showText(text);
return;
}
catch (IllegalArgumentException ex)
{
// do nothing
}
// now try separately
int i = 0;
while (i < text.length())
{
boolean found = false;
for (PDFont font : fonts)
{
try
{
String s = text.substring(i, i + 1);
font.encode(s);
// it works! Try more with this font
int j = i + 1;
for (; j < text.length(); ++j)
{
String s2 = text.substring(j, j + 1);
if (isWinAnsiEncoding(s2.codePointAt(0)) && font != fonts.get(0))
{
// Without this segment, the example would have a flaw:
// This code tries to keep the current font, so
// the second "abc" would appear in a different font
// than the first one, which would be weird.
// This segment assumes that the first font has WinAnsiEncoding.
// (all static PDType1Font Times / Helvetica / Courier fonts)
break;
}
try
{
font.encode(s2);
}
catch (IllegalArgumentException ex)
{
// it's over
break;
}
}
s = text.substring(i, j);
cs.setFont(font, size);
cs.showText(s);
i = j;
found = true;
break;
}
catch (IllegalArgumentException ex)
{
// didn't work, will try next font
}
}
if (!found)
{
throw new IllegalArgumentException("Could not show '" + text.substring(i, i + 1) +
"' with the fonts provided");
}
}
}
static boolean isWinAnsiEncoding(int unicode)
{
String name = GlyphList.getAdobeGlyphList().codePointToName(unicode);
if (".notdef".equals(name))
{
return false;
}
return WinAnsiEncoding.INSTANCE.contains(name);
}
}
Alternatives to arialuni can be found here:
https://en.wikipedia.org/wiki/Open-source_Unicode_typefaces

Remove space between images added into a single pdf file with iText using java.

I am trying to crate PDF files from a list of images. 4 image should cover a full page with no margin padding or anything. My problem is that the added images are separated by a white line, and I can't figure out a way to remove this separation.
public ByteArrayOutputStream createMultiTicketPdf(List<String> base64Images) {
PDFCreator creator = new PDFCreator();
Document document = creator.getDocument();
creator.setForMulti(true);
float nomargin = 0;
creator.addCustomCSS("common", "/pdf/common.css");
document.setMargins(nomargin, nomargin, nomargin, nomargin);
creator.setTemplateRelativePath("/pdf/multitickettemplate.html");
for(String base64Image : base64Images) {
try {
String parsedString = StringUtils.substringAfter(base64Image, ",");
byte[] decoded = Base64.getDecoder().decode(parsedString);
Image image = Image.getInstance(decoded);
float scaler = ((document.getPageSize().getWidth() - document.leftMargin()
- document.rightMargin()) / image.getWidth()) * 100;
image.scalePercent(scaler);
image.setPaddingTop(nomargin);
creator.addImage(Image.getInstance(image));
} catch (BadElementException | IOException e) {
LOGGER.error("Error occured:", e);
}
}
return creator.create();
}

iText - Get Font size and family of a text segment

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.
The following code I already have:
Main
public static void main(String[] args) throws IOException {
String src = "SEM_081145.pdf";
PdfReader reader = new PdfReader(src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
}
out.flush();
out.close();
}
And I have implemented the TextExtraction Strategy SemTextExtractionStrategy which looks like this:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
#Override
public void beginTextBlock() {
}
#Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
System.out.println(renderInfo.getFont().getFontType());
System.out.print(text);
}
#Override
public void endTextBlock() {
}
#Override
public void renderImage(ImageRenderInfo renderInfo) {
}
#Override
public String getResultantText() {
return text;
}
}
I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?
Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.

Thanks to Alexis I could convert his C# solution into Java code:
text = renderInfo.getText();
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();

I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):
val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
case (0, 0) => 0
case _ => length1 / length2
}

You can adapt the code provided in this answer, in particular this code snippet:
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

If you want the exact fontsize, use the following code in your renderText:
float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
- renderInfo.getDescentLine().getStartPoint().get(1);
Modify this as indicated in the other answers for rorated text.

How to Insert a Linefeed with PDFBox drawString

I have to make a PDF with a Table. So far it work fine, but now I want to add a wrapping feature. So I need to insert a Linefeed.
contentStream.beginText();
contentStream.moveTextPositionByAmount(x, y);
contentStream.drawString("Some text to insert into a table.");
contentStream.endText();
I want to add a "\n" before "insert". I tried "\u000A" which is the hex value for linefeed, but Eclipse shows me an error.
Is it possible to add linefeed with drawString?

The PDF format allows line breaks, but PDFBox has no build in feature for line breaks.
To use line breaks in PDF you have to define the leading you want to use with the TL-operator. The T*-operator makes a line break. The '-operator writes the given text into the next line. (See PDF-spec for more details, chapter "Text". It´s not that much.)
Here are two code snippets. Both do the same, but the first snippet uses ' and the second snippet uses T*.
private void printMultipleLines(
PDPageContentStream contentStream,
List<String> lines,
float x,
float y) throws IOException {
if (lines.size() == 0) {
return;
}
final int numberOfLines = lines.size();
final float fontHeight = getFontHeight();
contentStream.beginText();
contentStream.appendRawCommands(fontHeight + " TL\n");
contentStream.moveTextPositionByAmount(x, y);
contentStream.drawString(lines.get(0));
for (int i = 1; i < numberOfLines; i++) {
contentStream.appendRawCommands(escapeString(lines.get(i)));
contentStream.appendRawCommands(" \'\n");
}
contentStream.endText();
}
private String escapeString(String text) throws IOException {
try {
COSString string = new COSString(text);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
string.writePDF(buffer);
return new String(buffer.toByteArray(), "ISO-8859-1");
} catch (UnsupportedEncodingException e) {
// every JVM must know ISO-8859-1
throw new RuntimeException(e);
}
}
Use T* for line break:
private void printMultipleLines(
PDPageContentStream contentStream,
List<String> lines,
float x,
float y) throws IOException {
if (lines.size() == 0) {
return;
}
final int numberOfLines = lines.size();
final float fontHeight = getFontHeight();
contentStream.beginText();
contentStream.appendRawCommands(fontHeight + " TL\n");
contentStream.moveTextPositionByAmount( x, y);
for (int i = 0; i < numberOfLines; i++) {
contentStream.drawString(lines.get(i));
if (i < numberOfLines - 1) {
contentStream.appendRawCommands("T*\n");
}
}
contentStream.endText();
}
To get the height of the font you can use this command:
fontHeight = font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000 * fontSize;
You might want to multiply it whit some line pitch factor.

The pdf format doesn't know line breaks. You have to split the string and move the text position to the next line, using moveTextPositionByAmount.
This is not a special "pdfbox-feature", it is due to the pdf format definition; so there is no way for drawString and there are also no other methods to be called that support linefeeds.

Because the PDF model often isn't the best model for the task at hand, it often makes sense to write a wrapper for it that adds support for whatever's "missing" in your case. This is true for both reading and writing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write arabic characters with PDFBOX [duplicate] - java

Arabic can be written by applying both PDFBOX-922 and PDFBOX-1287 .(the diff files are attached to in issues description) I hope that the patches will be applied in the version 2.0.

i suggest you try adding ICU4J jars to your project : ICU4J

Related

No glyph found after getting Text and Font from existing pdf

PDFBox not supporting multiple languages

Remove space between images added into a single pdf file with iText using java.

iText - Get Font size and family of a text segment

How to Insert a Linefeed with PDFBox drawString

Categories

Resources