I need to get information about Checkbox for instance, checkbox mark(cross, circle, etc.), checked, unchecked and so forth. But I couldn't understand where this information has kept if in this file no XFA information. This file has been created via Adobe Acrobat Pro DC 19.21.20049. I tried to find out this information using pdfbox tool, but I didn't find it. That is a screenshot of checkbox information:
Can anyone explain to me how to get this information and where I must find it?
For any PdfFormField field with a value you can retrieve that value like this:
PdfObject value = field.getValue();
In case of a checkbox field this will return a PdfName. For an unchecked field, the name is Off. For a checked field it can be anything else even though the specification recommends Yes.
The appearances of these states are more difficult to determine because at least for the checked state a checkbox must provide an appearance stream containing instruction to create a visualization.
We have analyzed one such stream in response to your other recent question. In that case a ZapfDingbats tick symbol was used and you could apply text extraction to the stream and determine this.
In other cases, e.g. in case of a crossed check box, usually only vector graphics instructions are used.
It also is possible, though, to use a bitmap image here which may show anything.
Thus, while you can of course compare the appearance stream with the standard appearances generated by e.g. Adobe Acrobat and so identify the appearance of many checkbox fields, you won't be able to automatically identify all.
Related
My program injects text and pictures into a Word template. This works great via content control data binding (thanks to docx4j and Content-Control-Toolkit).
My problem is, that images get resized after injection. What I actually want, is the behavoir that Jason decribed here: http://www.docx4java.org/forums/data-binding-java-f16/picture-content-control-size-t634.html
The current behaviour is to just let it be whatever its natural size is (at a given dpi), unless that is greater than page width, in which case it is scaled down.
According to that post, the behavoir of docx4j has been changed so that the pictures always fit the size of the content control with respect to the ratio.
Is it possible to get the "old" behavoir back? Do I have to do that on my own, or is the switch, that Jason wrote about, already implemented?
As the answer to How to force Docx4j to refresh a replaced image file states, the size of a picture is stored in the main document part. At the moment, I only use XPath to set content in the custom XML part. If there is any possibility to get what I need without touching the documents XML directly, I would really prefer that. A macro to set the size after opening the document in Word is no option for me.
The first thing to be aware of is that these days we prefer to have a picture in a rich text content control, as opposed to a picture content control.
This is because Word limits your ability to "float" a picture content control.
The handling for this is triggered by w:tag containing 'od:Handler=picture': datastorage/bind.xslt#L165
The basic behaviour is that if the w:sdtContent contains an existing w:drawing/wp:inline/a:graphic then reuse it, so any formatting thus configured is used.
But for a "legacy" picture content control which doesn't contain a:blip (when would this be?), xpathInjectImage is invoked with wp:extent passed in (see bind.xslt#L240).
At line 1143, if (cxl==0 || cyl==0) // Let BPAI work out size
So if you want the image at its natural size, you could try removing the when clause at bind.xslt#L212
By the way, we can also bind escaped XHTML. But there, we make an effort to fit any image not just to the page width, but if in a table cell, to that as well.
I need to be able to customize the checkbox fill type on demand as I render the pdf.
I must do this only with the AcroFields (pdfStamper.getAcroFields()), because I'm not creating any new fields (just modifying), and that's the only thing I have available to me in scope. I have tried about a hundred things, including the one listed below, which was my best guess on how to do this.
PdfDictionary dictionary = (PdfDictionary) acrofields.getFieldItem("ABCD").getWidget(0).get(PdfName.MK);
dictionary.put(PdfName.CA, new PdfString("8", PdfObject.TEXT_UNICODE));
ABCD is just for testing.
I am very stuck, and any help would be greatly appreciated. I am able to change the value in the dictionary, but it does not seem to have any affect when I write the pdf to a file. Other changes such as setting the checkbox to being checked/unchecked work, as well as populating text fields. So I as very surprised and confused why this is not working.
You're trying to change the caption of a check box, but it's unclear to me what you expect to see. Do you want to see the text "8"? In that case, changing the caption isn't sufficient. You also need to change the appearance. You can find the possible appearances under the /AP key. In the case of a check box, you'll find two possible appearance states under the normal appearance (/N). These XObjects define what you see when the PDF is rendered.
Is there any way in iText to format a TextField's input? I want to have a TextField accept a phone number "(###)###-####", but I don't want the user to have to format it when they enter it. Pdf supports masks on form fields, is there any way to do this in iText?
My current solution is to create the pdf in Acrobat, then populate known fields through iText. But that isn't ideal for this deployment. Ideally I'll have iText generate the entire form.
Thanks for all assistance in advance.
You can add JavaScript to your form that changes the content of fields. See for instance the Calculator example for a fun PDF that acts as a Calculator (obviously this app only works in a PDF viewer that supports JavaScript).
When you create a text field, you need to add an additional action with the setAdditionalActions() method. You can choose between different events: K for keystroke (e.g. useful if you want to change every character to uppercase when somebody fills out a form), Bl for blurred (useful to process the content of a field as soon as the focus is lost), etc.
You can write your own document-level JavaScript to format the fields. See calculator.js for the JavaScript used in the Calculator example. Or you can use one of the many AF methods that are predefined Adobe Reader, such as AFNumber_Format (I don't find an overview of the available methods right now).
I use the Java version of PDF Clown to fill out the fields of PDF Acroforms. This works great and I'm able to programmatically fill out forms and save them without any issues.
However, some PDF viewers render some of the text invisible in the fields I'm filling out, unless you click on them in which case they become visible. This forum post explains that this can happen in form-fillable PDFs in general and that it can be fixed by setting the background color of the PDF field to "None", even if the GUI already says that the background color is "None." This has worked for others and I'd like to try it for myself.
Unfortunately, I"m stuck on how to actually do this in PDFClown. There isn't a direct method like field.setBackgroundColor(null) for the Field class and I'm not able to figure out a way to do it by using one of the other accessor methods, like getDefaultAppearanceState().
Is there anyone who knows how to do this in PDF Clown?
EDIT: A sample PDF with this issue can be found here. Everything in this PDF was filled in with PDF Clown. Note in particular that the two fields in the upper left (labeled with "Name") are invisible until clicked on. The five fields in the right are also invisible until clicked on, except for the "Charisma" field, which was previously invisible, but then I manually typed in the value and then it was made visible. Everything else was put in by PDF Clown, but unlike the other fields was made visible.
EDIT 2: It has since been discovered that this only happens when you overwrite values in an existing form-fillable character sheet. An original can be downloaded here.
As a first analysis:
Nearly as suspected in my original comment, the field "Name Line 1" contains the value (field dictionary V) "Doc Lightning" but a normal appearance stream (field dictionary AP -> appearances dictionary, key N) which displays no text.
Furthermore the interactive form dictionary entry NeedAppearances is not set to true; thus, the PDF viewer is made believe that the appearance streams are up-to-date. Only when you click into the field and, therefore, signal that you want to edit, the PDF viewer generates a new appearance of the stream, an appearance of its own making which it understands completely for the task of editing.
If you filled in that form field and no other tool changed your results afterwards, therefore, something is wrong either in your code or in PDF Clown. Please provide some self-contained sample code and not-yet-filled-in document to reproduce the issue.
EDIT:
I just applied the current (trunk) PDF Clown AcroFormFillingSample.java sample to the not-yet-filled-in Character Sheet (i.e. the revision consisting of the initial 1458834 bytes of your file), and the result is ok, all field contents are visible even without clicking into them. Thus there is something special in your source... (or do you use an older version?)
In detail:
Page 1 of the character sheet of Doc Lightning references the annotation in object 162:
/MK <<>>
/F 4
/Type /Annot
/Subtype /Widget
/Rect [37.0108, 617.055, 156.923, 631.717]
/FT /Tx
/DA /Helv 12 Tf 0 g
/T (Name Line 1)
/V (Doc Lightning)
/P 47 0 R
/AP 537 0 R
Thus, the value of the field indeed is "Doc Lightning".
On the other hand, the appearances dictionary in object 537 references the normal appearance stream:
/N 538 0 R
And the stream in object 538 only contains:
/Tx BMC
q
1 0 0 1 2 -7.331 cm
/Helv 12 Tf
Q
EMC
So the normal appearance stream positions in the field (setting the current transformation matrix accordingly) and selects a font (Helvetica, properly defined in the ressources, BTW), and then prints... nothing!
The interactive form dictionary (object 144) does not contain a NeedAppearances entry at all. According to the PDF specification ISO 32000-1:2008, Table 218, this entry is
A flag specifying whether to construct appearance streams and appearance dictionaries for all widget annotations in the document (see 12.7.3.3, “Variable Text”). Default value: false.
Thus, the PDF viewer acts just like expected when not showing the value "Doc Lightning" of "Name Line 1" but instead the empty appearance stream.
After revisiting this issue, and carefully looking at the source code, I realized that the Sample.java class of PDFClown's samples had an applyDocumentSettings() method that contained three lines of code missing from my source:
//Previously we instantiated "document" from org.pdfclown.files.File.getDocument()
ViewerPreferences view = new ViewerPreferences(document); // Instantiates viewer preferences inside the document context.
document.setViewerPreferences(view); // Assigns the viewer preferences object to the viewer preferences function.
view.setDisplayDocTitle(true);
I'm not sure that the last line is actually necessary, but I went ahead and kept it in for good measure.
The user mkl wrote in his answer that "the PDF viewer generates a new appearance of the stream, an appearance of its own making which it understands completely for the task of editing." It seems that what the lines of code do above is generate an appearance that is understood to be for reading (and maybe editing?).
I have been playing around with PdfBox and PDFTextStripperByArea method.
I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.
As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.
Here is my code so far:
List<TextPosition> textPos = charactersByArticle.get(index);
for (TextPosition t : textPos)
{
if (t.getFont().getFontDescriptor() != null)
{
if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
t.getFont().getFontDescriptor().isForceBold())
{
isBold = true;
}
if (t.getFont().getFontDescriptor().isItalic())
{
isItalic = true;
}
}
}
I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.
Any suggestions where this information could be retrieved from ?
Here is what I have found out so far:
PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.
If we take a look at the PDFTextStripper.properties resource file under:
pdfbox\src\main\resources\org\apache\pdfbox\resources\
we can see that for instance the BT operator is bound to the
org.apache.pdfbox.util.operator.BeginText class and so on.
The PDFTextStripper under
pdfbox\src\main\java\org\apache\pdfbox\util\
takes this into account and utilizes the processing of the PDF with this classes.
BUT all graphical objects are ignored, therefore no information of underline or table structure!
Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under
pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\
The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.
Now this would mean reading the PDF file specification, which is currently way to much work.
If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.
As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.
You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.
My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.
http://learnpostscript.wordpress.com/category/lineto/
PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.
However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.
Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.
PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.
Hope this helps!
you can use Itext to generate pdf reports.
by using itext you can able to put the lines in easy way.
try the follwing.
document.add(new LineSeparator(0.5f, 50, null, 0, 198));
the above code is used to generate lines in pdf report. and set the dimensions according to your choice.
hope this will help you.
As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.
According to the api getfont() returns The font size.
You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.