I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters.
field.setValue("TEST-BY (TEST)")
In detail, if my value contains characters as U+00A0 i am getting the following exception:
Caused by: java.lang.IllegalArgumentException: U+00A0 is not
available in this font's encoding: WinAnsiEncoding
A complete stracktrace can be found here: Stacktrace
I currently have set PDType1Font.TIMES_ROMAN as font. In order to solve this problem i tried with other available fonts as well. The same problem persisted.
I found the following suggestion in this answer https://stackoverflow.com/a/22274334/7434590 but since we use the setValue and not any of the methods showText/drawText that can manipulate bytes, i could not use this approach since setValue accepts only string as a parameter.
Note: I cannot replace the characters with others to solve this issue, i must be able to set any kind of supported by the font character in the setValue method.
You'll have to embed a font and not use WinAnsiEncoding:
PDFont formFont = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/somefont.ttf"), false); // check that the font has what you need; ARIALUNI.TTF is good but huge
PDResources res = acroForm.getDefaultResources(); // could be null, if so, then create it with the setter
String fontName = res.add(formFont).getName();
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g"; // adjust to replace existing font name
textField.setDefaultAppearance(defaultAppearanceString);
Note that this code must be ran before calling setValue().
More about this in the CreateSimpleFormWithEmbeddedFont.java example from the source code download.
Avoid using WinAnsiEncoding (problems with encoding)
PDDocument document = new PDDocument();
//Fonts
InputStream fontInputStreamAvenirMedium = new URL(Constants.S3 + "/Fonts/Avenir-Medium.ttf").openStream();
InputStream fontInputStreamAvenirBlack = new URL(Constants.S3 + "/Fonts/Avenir-Black.ttf").openStream();
InputStream fontInputStreamDINCondensedBold = new URL(Constants.S3 + "/Fonts/DINCondensedBold.ttf").openStream();
PDFont font = PDType0Font.load(document, fontInputStreamAvenirMedium);
PDFont fontBold = PDType0Font.load(document, fontInputStreamAvenirBlack);
PDFont fontDIN = PDType0Font.load(document, fontInputStreamDINCondensedBold);
//PDFont font = PDTrueTypeFont.load(document, fontInputStreamAvenirMedium, WinAnsiEncoding.INSTANCE); /* encoding problems */
//PDFont fontBold = PDTrueTypeFont.load(document, fontInputStreamAvenirBlack, WinAnsiEncoding.INSTANCE); /* encoding problems */
//PDFont fontDIN = PDTrueTypeFont.load(document, fontInputStreamDINCondensedBold, WinAnsiEncoding.INSTANCE); /* encoding problems */
See also: https://pdfbox.apache.org/2.0/faq.html#fontencoding
Related
I have the following code:
final Footer footer = getSheet().getFooter();
final StringBuilder strFooterText = new StringBuilder();
strFooterText.append(DETAILS_FOOTER.get(0));
strFooterText.append("\n");
// another line
strFooterText.append(DETAILS_FOOTER.get(1));
strFooterText.append(getDetails());
final String fnt = HeaderFooter.font(DEFAULT_FONT_NAME, "regular")
+ HeaderFooter.fontSize(DEFAULT_DETAILS_FOOTER_FONT_HEIGHT);
footer.setLeft(fnt + strFooterText.toString());
That kind of code works fine when I open the resulting XLSX with LibreOffice. When I open it with Excel 2016 the used fnt won't work. With the repair option this is gone be removed.
Is there a way to change for the footer the font size to have it working with Excel2016 (and later)?
UPDATE
After the responses I figured out that I have next to the left footer a right footer with data:
footer.setRight("Page " + HeaderFooter.page() + " of " + HeaderFooter.numPages());
This result into CDATA .
From the updates I figured out that the font-name is the trouble maker. Without the font-name it works. So the following code works:
final Footer footer = getSheet().getFooter();
final StringBuilder strFooterText = new StringBuilder();
strFooterText.append(DETAILS_FOOTER.get(0));
strFooterText.append("\n");
// another line
strFooterText.append(DETAILS_FOOTER.get(1));
strFooterText.append(getDetails());
//final String fnt = HeaderFooter.font(DEFAULT_FONT_NAME, "regular")
// + HeaderFooter.fontSize(DEFAULT_DETAILS_FOOTER_FONT_HEIGHT);
footer.setLeft("&8" + strFooterText.toString());
footer.setRight("Page " + HeaderFooter.page() + " of " + HeaderFooter.numPages());
Don't know if there is a chance to set the font-name as well.
The CDATA issue comes from settings in xmlbeans.
Xmlbeans uses XmlOptions while reading and saving XML. There are following settings: XmlOptions.setSaveCDataLengthThreshold and XmlOptions.setSaveCDataEntityCountThreshold. The setSaveCDataLengthThreshold sets a minimal length of text content containing entities beyond which CDATA gets used. The setSaveCDataEntityCountThreshold sets a count of entities in text beyond which CDATA gets used. The default of setSaveCDataEntityCountThreshold is 5. So if text contents contain more than 5 entities, then it will be wrapped into CDATA.
So the CDATA problem occurs if a header or footer text contains more than 5 &, which in XML must be &. Then xmlbeans uses CDATA blocks for that text.
To avoid this the only way would be changing the POIXMLTypeLoader.DEFAULT_XML_OPTIONS. There appropriate settings for setSaveCDataLengthThreshold and setSaveCDataEntityCountThreshold would must be placed. As Excel itself never uses CDATA blocks, apache poi also should not.
But you should more precise tell which version of Excel has problems with these CDATA blocks. For me the following complete example produces the wanted results in Excel 2016 as well as in Excel 365 (all in Windows), even if CDATA blocks are used in the XML.
import java.io.FileOutputStream ;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.hssf.usermodel.HeaderFooter;
public class CreateExcelFooterText {
static final String DEFAULT_FONT_NAME = "Arial";
static final short DEFAULT_DETAILS_FOOTER_FONT_HEIGHT = 8;
public static void main(String[] args) throws Exception {
StringBuilder strFooterText = new StringBuilder();
strFooterText.append("The footer text");
strFooterText.append("\n");
strFooterText.append("containing ");
strFooterText.append("multiple lines");
String fnt = HeaderFooter.font(DEFAULT_FONT_NAME, "regular")
+ HeaderFooter.fontSize(DEFAULT_DETAILS_FOOTER_FONT_HEIGHT);
Workbook workbook = new XSSFWorkbook(); String filePath = "./CreateExcelFooterText.xlsx";
//Workbook workbook = new HSSFWorkbook(); String filePath = "./CreateExcelFooterText.xls";
Sheet sheet = workbook.createSheet();
sheet.createRow(0).createCell(0).setCellValue("A1");
Footer footer = sheet.getFooter();
footer.setLeft(fnt + strFooterText.toString());
footer.setCenter("&\"Times New Roman,bold\"&24&K00FF00Center footer\n&\"Arial,regular\"&8&K000000further Text");
footer.setRight("Page " + HeaderFooter.page() + " of " + HeaderFooter.numPages());
FileOutputStream out = new FileOutputStream(filePath);
workbook.write(out);
out.close();
workbook.close();
}
}
So if you run this complete example, which exact version of Excel will not properly show the *.xlsx file then?
So the problem was not the CDATA usage but the hit of a Excel limit. See: Excel specifications and limits:
Characters in a header or footer: 255
So a header and/or footer must not contain more than 255 characters in sum.
I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.
I am currently trying to open, edit & save a PDF file using PDFBox.
with plain-text fields it already works but I'm having a hard time setting RichTextFormat-Text as value, since everytime I use "setRichTextValue", save and open the document, the field is empty (unchanged).
Code is as follows (stripped from multiple functions):
PDDocument pdfDoc = PDDocument.load(new File("my pdf path"));
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField("field-to-change");
if (field instanceof PDTextField) {
PDTextField tfield = (PDTextField) field;
COSDictionary dict = field.getCOSObject();
//COSString defaultAppearance = (COSString) dict.getDictionaryObject(COSName.DA);
//if (defaultAppearance != null && font != "" && size > 0)
// dict.setString(COSName.DA, "/" + font + " " + size + " Tf 0 g");
boolean rtf = true;
String val = "{\rtf1\ansi\deff0 {\colortbl;\red0\green0\blue0;\red255\green0\blue0;} \cf2 Red RTF Text \cf1 }";
tfield.setRichText(rtf);
if (rtf)
tfield.setRichTextValue(val);
else
tfield.setValue(val);
}
// save document etc.
by digging the PDFBox documentation I found this for .setRichTextValue(String r)
* Set the fields rich text value.
* Setting the rich text value will not generate the appearance
* for the field.
* You can set {#link PDAcroForm#setNeedAppearances(Boolean)} to
* signal a conforming reader to generate the appearance stream.
* Providing null as the value will remove the default style string.
* #param richTextValue a rich text string
so I added
pdfDoc.getDocumentCatalog().getAcroForm().setNeedAppearances(true);
..directly after the PDDocument object and it didnt change anything. So I searched further and found the AppearanceGenerator class, which should create the styles automatically? But it doesnt seem to, and you cant call it manually.
I'm at a loss here and Google is no help either. Seems nobody ever used this before or I'm just too stupid. I want the solution to be done in PDFBox since you dont pay for licenses and it already works for everything else I am doing (getting & replacing images, removing text fields), so it must be possible right?
Thanks in advance.
i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();
I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.
This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Using PDFBox to write UTF-8 encoded strings to a PDF
I need to create PDF with Czech national characters, and I'm trying to do it with PDFBox library.
I have copied following code from some tutorials:
public void doIt(String file, String message) throws IOException, COSVisitorException
{
PDDocument doc = null;
try
{
doc = new PDDocument();
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
TextToPDF textToPdf = new TextToPDF();
textToPdf.setFont(font);
textToPdf.setFontSize(12);
doc = textToPdf.createPDFFromText(new StringReader(message));
doc.save(file);
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Now, I'am calling function doIt:
app.doIt("test.pdf", "Skákal pes přes oves, přes zelenou louku.");
This completely works, but in output PDF I get: "þÿSkákal pes pYes oves, pYes zelenou louku."
I tried to find how to set UTF-8 encoding in PDFBox, but IMHO there is just no solution for this on the internet.
Do you have any ideas, how to get right text in output PDF?
Thank you.
I think its PDType1Font.TIMES_ROMAN font which is not supporting your Czech national characters. If you can manage to get the .ttf files for the Czech national characters, then use below to get PDFont as below and use the same:
PDFont font = PDTrueTypeFont.loadTTF( doc, new File( "CheckRepFont.ttf" ) );
Here CheckRepFont.ttf is your font file name as an example. Update it with actual one.
EDIT:
PDStream pdStream = new PDStream(doc);
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
font.setToUnicode(pdStream);