I am using Apache PDFBox for configuration of PDTextField's on a PDF document where I load Lato onto the document using:
font = PDType0Font.load(
#j_pd_document,
java.io.FileInputStream.new('/path/to/Lato-Regular.ttf')
) # => Lato-Regular
font_name = pd_default_resources.add(font).get_name # => F4
I then pass the font_name into a default_appearance_string for the PDTextField like so:
j_text_field.set_default_appearance("/#{font_name} 0 Tf 0 g") # where font_name is
# passed in from above
The issue now occurs when I proceed to invoke setValue on the PDTextField. Because I set the font_size in the defaultAppearanceString to 0, according to the library's example, the text should scale itself to fit in the text box's given area. However, the behaviour of this 'scale-to-fit' is inconsistent for certain fields: it does not always choose the largest font size to fit in the PDTextField. Might there be any further configuration that might allow for this to happen? Below are the PDFs where I've noticed this problem occurring.
Unfilled, with fonts loaded:
http://www.filedropper.com/0postfontload
Filled, with inconsisteny textbox text sizing:
http://www.filedropper.com/file_327
Side Note: I am using PDFBox through jruby which is just a integration layer that allows Ruby to invoke Java libraries. All java methods for the library available; a java method like thisExampleMethod would have a one-to-one translation into ruby this_example_method.
Updates
In response to comments, the appearances that are incorrect in the second uploaded file example are:
1st page Resident Name field (two text fields that have text that is too small for the given input field size)
2nd page Phone fields (four text fields that have text that overflows the given input field size)
Especially the appearances of the Resident Name fields, the Phone fields, and the Care Providers Address fields appear conspicuous. Only the former two are mentioned by the OP.
Let's inspect these fields; all screen shots are made using Adobe Reader DC on MS Windows:
The Resident Name fields
The filled in Resident Name fields look like this
While the height is appropriate, the glyphs are narrower than they should be. Actually this effect can already be seen in the original PDF:
This horizontal compression is caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
The widget rectangles: [ 45.72 601.44 118.924 615.24 ] and [ 119.282 601.127 192.486 614.927 ], i.e. 73.204*13.8 in both cases.
The appearance bounding box: [ 0 0 147.24 13.8 ], i.e. 147.24*13.8.
So they have the same height but the appearance bounding box is approximately twice as wide as the widget rectangle. Thus, the text drawn normally in the appearance stream gets compressed to half its width when the appearance is displayed in the widget rectangle.
When setting the value of a field PDFBox unfortunately re-uses the appearance stream as is and only updates details from the default appearance, i.e. font name, font size, and color, and the actual text value, apparently assuming the other properties of the appearance are as they are for a reason. Thus, the PDFBox output also shows this horizontal compression
To make PDFBox create a proper appearance, it is necessary to remove the old appearances before setting the new value.
The Phone fields
The filled in Phone fields look like this
and again there is a similar display in the original file
That only the first two letters are shown even though there is enough space for the whole word, is due to the configuration of these fields: They are configured as comb fields with a maximum length of 2 characters.
To have a value here set with PDFBox displayed completely and not so spaced out, you have to remove the maximum length (or at least have to make it no less than the length of your value) and unset the comb flag.
The Care Providers Address fields
Filled in they look like this:
Originally they look similar:
This vertical compression is again caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:
A widget rectangle: [ 278.6 642.928 458.36 657.96 ], i.e. 179.76*15.032.
The appearance bounding box: [ 0 0 179.76 58.56 ], i.e. 179.76*58.56.
Just like in the case of the Resident Name fields above it is necessary to remove the old appearances before setting the new value to make PDFBox create a proper appearance.
A complication
Actually there is an additional issue when filling in the Care Providers Address fields, after removing the old appearances they look like this:
This is due to a shortcoming of PDFBox: These fields are configured as multi line text fields. While PDFBox for single line text fields properly calculates the font size based on the content and later finely makes sure that the text vertically fits quite well, it proceeds very crudely for multi line fields, it selects a hard coded font size of 12 and does not fine tune the vertical position, see the code of the AppearanceGeneratorHelper methods calculateFontSize(PDFont, PDRectangle) and insertGeneratedAppearance(PDAnnotationWidget, PDAppearanceStream, OutputStream).
As in your form these address fields anyways are only one line high, an obvious solution would be to make these fields single line fields, i.e. clear the Multiline flag.
Example code
Using Java one can implement the solutions explained above like this:
final int FLAG_MULTILINE = 1 << 12;
final int FLAG_COMB = 1 << 24;
PDDocument doc = PDDocument.load(originalStream);
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDType0Font font = PDType0Font.load(doc, fontStream, false);
String font_name = acroForm.getDefaultResources().add(font).getName();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.getCOSObject().removeItem(COSName.MAX_LEN);
textField.getCOSObject().setFlag(COSName.FF, FLAG_COMB | FLAG_MULTILINE, false);;
textField.setDefaultAppearance(String.format("/%s 0 Tf 0 g", font_name));
textField.getWidgets().forEach(w -> w.getAppearance().setNormalAppearance((PDAppearanceEntry)null));
textField.setValue("Test");
}
}
(FillInForm test testFill0DropOldAppearanceNoCombNoMaxNoMultiLine)
Screen shots of the output of the example code
The Resident Name field value now is not vertically compressed anymore:
The Phone and Care Providers Address fields also look appropriate now:
Related
File example: file.
Problem - when extracting text using PdfTextStripper, there is token "9/1/2017" and "387986" after "ASSETS" in the page start which should be removed, and some others hidden tokens.
I have already applied this solution (so I do not copy-paste it here, because actually problem is exactly the same) and still that hidden text is appearing on page. Could it be hidden by something else except clip path?
thanks!
Could it be hidden by something else except clip path?
Yes. In case of your new document the text is written in white on white, e.g. the 387986 after ASSETS is drawn like this:
1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(#A,BAC)Tj
The initial 1 1 1 rg sets the fill color to RGB WHITE. (Additionally that text is quite tiny but would still be visible if drawn in e.g. BLACK.)
The solution you refer to was implemented for documents like the sample document presented in that issue in which the invisible text is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, your white text won't be recognized by it as hidden.
Unfortunately recognizing invisibility of WHITE on WHITE text is more difficult to determine than that of clipped or covered text because one not only needs to know the a property of the current graphics state (like the clip path) or remove all text inside a given path, one also needs to know the color of the part of the page right before the text is drawn (to check the on WHITE detail).
If, on the other hand, you assume the page background to be essentially WHITE, it is fairly simple to ignore all white text: Simply also detect the current fill color in processTextPosition:
PDColor fillColor = gs.getNonStrokingColor();
and compare it to the flavors of WHITE you want to consider invisible. (Usually it should suffice to compare with RGB, CMYK, and Grayscale WHITE; in seldom cases you'll also have to correctly interpret more complex color spaces. Additionally you might also consider nearly WHITE colors invisible, (.99, .99, .99) RGB can hardly be distinguished from WHITE.)
If you find the current color to be WHITE, ignore the current TextPosition.
Be aware, though, just like the solution you referenced this is not yet the final solution recognizing all WHITE text: For that you'll also have to check the text rendering mode: If it is just filling (the default), the above holds, but if it is (also) stroking, you'll (also) have to consider the stroking color; if it is rendered invisible, there is no color to consider; and if the text rendering mode includes adding to path for clipping, you'll have to wait and determine what will be later drawn in this part of the page as long as the clip path holds, definitely not trivial!
Referring to Build text callout with PDF Clown - Is there a possibility to change the font color of the text within the callout note?
I haven't found a suitable method yet, can someone please give me a hint?
There is no explicit PDF Clown method to set the text color. This might be related to the fact that there is no explicit entry in the PDF annotation dictionary for it either.
There are two options, though:
There is a default appearance (DA) entry for variable text in annotations in general. As PDF Clown does not hide generic object methods, you can extend the original callout sample like this:
// Callout.
composer.showText("Callout note annotation:", new Point(35, 85));
new StaticNote(
page,
new Rectangle(250, 90, 150, 70),
"Text of the Callout note annotation"
).withLine(
new StaticNote.CalloutLine(
page,
new Point(250,125),
new Point(150,125),
new Point(100,100)
)
)
.withLineEndStyle(LineEndStyleEnum.OpenArrow)
.withBorder(new Border(1))
.withColor(DeviceRGBColor.get(Color.YELLOW))
.getBaseDataObject().put(PdfName.DA, new PdfString("1 0 1 rg /Ti 12 Tf"));
You have to use plain PDF instructions there, though, rg sets a RGB color defined by the three preceding values, and Tf sets font and size according to the preceding two values. The result of the above is:
As you see, the text now is purple (red 100%, green 0%, blue 100%). A side effect is, though, that the callout line and the frame around the callout box also are purple.
Alternatively a PDF can bring along an own appearance stream defining the whole appearance of the annotation in question. This means, though, that you really have to draw everything yourself including lines, frames, backgrounds, and text.
PDF Clown allows you to set the appearance of an annotation using the setAppearance and withAppearance methods.
I'm trying to learn PDFBox v2.0 but there seems to be few useful examples and tutorials to get started.
I want to create a simple PDF with text only (no images and fancy stuff) and it looks as the following:
1- Introduction
This is an intro to the document
2- Data
2.1- DataPart1
This is some text here....
2.2- DataPart2
This is another text right here!
3- More Information
Some important informational text here...
I have written the following code:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
firstContentStream.beginText();
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
The text "This is the document intro" appears at the end of the page!
My question: do I have to set the cursor or the offset on every line???!
This seems like hard work. Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
I really don't understand.
And also how do I make it generate new pages whenever the current page is full?
Could you please give one simple example? Thanks in advance
Can't I just do something like:
contentStream.addTextOnNewLine("Some text here...");
First of all please be aware that PDFBox has a very low-level text drawing API only, i.e. most methods you call directly correspond to a single instruction in the PDF content stream. In particular PDFBox does not come with a (publicly accessible) layout'ing functionality but you essentially have to do the layout'ing in your code. This is both a freedom (you are not bound by the limits of an existing machinery) and a burden (you have to do everything yourself).
In your case that means:
Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
Yes. Confer the answers #Tilman linked for you
How to generate multiple lines in PDF using Apache pdfbox
Creating a new PDF document using PDFBOX API
and search some more on stackoverflow, there are further answer on different variations of that theme, e.g. for justified text blocks here.
But there is a detail in your code which in your case makes things possibly harder than necessary:
My question: do I have to set the cursor or the offset on every line???!
While you do have to re-position for every new line, it suffices to re-position by offset from the start of the previous line if you still are in the same text object, i.e. if you have not called endText() and beginText() in-between. E.g.:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
// removed firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
// removed firstContentStream.beginText();
firstContentStream.newLineAtOffset(0, -TEXT_FONT_SIZE); // reposition, tightly packed text lines
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
newLineAtOffset(0, -TEXT_FONT_SIZE) sets the text insertion position at the same x coordinate as the start of the previous line and TEXT_FONT_SIZE units lower.
(Using TEXT_FONT_SIZE here will result in pretty tightly set text lines; you may want to use a higher value, e.g. 1.4 times the font size.)
Considering your description you might want to use a positive x offset value for the line after a heading, though.
The text "This is the document intro" appears at the end of the page!
This happens because the PDF instruction generated by firstContentStream.beginText() resets the text insertion position to the origin (0, 0) which by default is the lower left corner.
And also how do I make it generate new pages whenever the current page is full?
Well, it does not happen automatically, you explicitly have to do that when you want to switch to a new page. And you do that more or less exactly like you created your first page.
While parsing a already present pdf, I am using
if(op.getOperation().equals( "TJ")) to get text operators, What I want to do is to target only the ones whose color is black(or some other specifiable color). I am unable to find a method for the same in pdfBox docs.
Edit : Basically what I want to do is to keep only black colored text on the pdf, and remove/delete any other text operator which doesnt match the criteria.
Can anyone share a solution ?
Thanks !
Text showing operators
While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators,
There are more text showing operators you have to take care of in general:
string Tj Show a text string.
string ' Move to the next line and show a text string. This operator shall have the
same effect as the code T* string Tj
aw ac string " Move to the next line and show a text string, using aw as the word spacing and ac as the character spacing (setting the corresponding parameters in the text state). aw and ac shall be numbers expressed in unscaled text space units. This operator shall have the same effect as this code: aw Tw ac Tc string '
array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount.
(Table 109 in the Pdf specification ISO 32000-1)
Text color
The color used to show text depends on the current text rendering mode.
The text rendering mode, Tmode, determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three.
(section 9.3.6 in the Pdf specification ISO 32000-1)
It is set using the Tr operator:
render Tr Set the text rendering mode, Tmode, to render, which shall be an integer. Initial value: 0.
(Table 105 in the Pdf specification ISO 32000-1)
Depending on this mode you have to consider the current stroke color, the current fill color, the color of whatever is later-on painted in the defined clipping boundary, or some combination of the three.
The color setting operators are defined in Table 74 of the specification ISO 32000-1.
Most often the glyph outlines merely are filled (mode 0). Thus, most often you have to consider the current fill color. That still leaves quite a lot of color setting commands to consider.
Most often gray, RGB, or CMYK colors are used here. Thus, most often you will have to check
the g, rg, or k operators.
Pure black is set by 0 g, 0 0 0 rg, or 0 0 0 1 k. You might also want to consider values which are very near to those values; they might have been intended as black and only differ due to rounding issues.
Color transformations
To make things a bit more complex: The colors mentioned above may still be transformed to some completely different color, e.g. by means of transfer functions (cf. section 10.4), transparency or blending (cf. section 11).
If you also want to consider these effects, you essentially program your own PDF renderer.
Normally, though, PDFs intended mainly for text on the web don't use these features. Thus, for your purposes I would not consider them at first.
I am using itext and want to make my acrofields curved. Say textfield with rounded corners,
and apply the same to buttons and imageField(pustButtonField).
Is it possible in itext or by using some other api.
Thanks in advance for everybody valuable reply...
Depending on what you're actually asking, there are three possible answers to this question:
ISO-32000 only allows you to define a rectangle as the clickable area of AcroForm fields. This is the area that is highlighted when you select highlight fields. You can define a border for this rectangle consisting of an array containing at least 3 values: the horizontal corner radius, the vertical corner radius and the border width. An optional fourth value allows you to define the dash pattern.
Apart from this you can create any appearance for a widget annotation that corresponds with an AcroForm field. The appearance is stored in the /AP entry of the annotation dictionary. This is quite common for button fields (see for instance the createAppearance() method in the Calculator example). This is not done for text fields, as the appearance will disappear the moment somebody changes the value of the text field.
Maybe you are asking to create a rectangular border that is part of the content stream of the page as opposed to a shape that is defined at the AcroField level (see for instance how Open Office adds border to form fields: these borders don't disappear when you remove the field dictionary).
If I had to guess, I'd say you're looking for answer 2 regarding buttons and for answer 3 regarding text fields.
Update: thank you for accepting the answer even though I misunderstood the question. You were asking for a field where you define a path (not necessarily a straight line) that will be used to position the value of the field (for instance: a word written in a way that the characters form a circle). That's not possible with AcroForm fields.