PDF Fields Appear Invisible with PDF Clown

PDF Fields Appear Invisible with PDF Clown - java

I use the Java version of PDF Clown to fill out the fields of PDF Acroforms. This works great and I'm able to programmatically fill out forms and save them without any issues.
However, some PDF viewers render some of the text invisible in the fields I'm filling out, unless you click on them in which case they become visible. This forum post explains that this can happen in form-fillable PDFs in general and that it can be fixed by setting the background color of the PDF field to "None", even if the GUI already says that the background color is "None." This has worked for others and I'd like to try it for myself.
Unfortunately, I"m stuck on how to actually do this in PDFClown. There isn't a direct method like field.setBackgroundColor(null) for the Field class and I'm not able to figure out a way to do it by using one of the other accessor methods, like getDefaultAppearanceState().
Is there anyone who knows how to do this in PDF Clown?
EDIT: A sample PDF with this issue can be found here. Everything in this PDF was filled in with PDF Clown. Note in particular that the two fields in the upper left (labeled with "Name") are invisible until clicked on. The five fields in the right are also invisible until clicked on, except for the "Charisma" field, which was previously invisible, but then I manually typed in the value and then it was made visible. Everything else was put in by PDF Clown, but unlike the other fields was made visible.
EDIT 2: It has since been discovered that this only happens when you overwrite values in an existing form-fillable character sheet. An original can be downloaded here.

As a first analysis:
Nearly as suspected in my original comment, the field "Name Line 1" contains the value (field dictionary V) "Doc Lightning" but a normal appearance stream (field dictionary AP -> appearances dictionary, key N) which displays no text.
Furthermore the interactive form dictionary entry NeedAppearances is not set to true; thus, the PDF viewer is made believe that the appearance streams are up-to-date. Only when you click into the field and, therefore, signal that you want to edit, the PDF viewer generates a new appearance of the stream, an appearance of its own making which it understands completely for the task of editing.
If you filled in that form field and no other tool changed your results afterwards, therefore, something is wrong either in your code or in PDF Clown. Please provide some self-contained sample code and not-yet-filled-in document to reproduce the issue.
EDIT:
I just applied the current (trunk) PDF Clown AcroFormFillingSample.java sample to the not-yet-filled-in Character Sheet (i.e. the revision consisting of the initial 1458834 bytes of your file), and the result is ok, all field contents are visible even without clicking into them. Thus there is something special in your source... (or do you use an older version?)
In detail:
Page 1 of the character sheet of Doc Lightning references the annotation in object 162:
/MK <<>>
/F 4
/Type /Annot
/Subtype /Widget
/Rect [37.0108, 617.055, 156.923, 631.717]
/FT /Tx
/DA /Helv 12 Tf 0 g
/T (Name Line 1)
/V (Doc Lightning)
/P 47 0 R
/AP 537 0 R
Thus, the value of the field indeed is "Doc Lightning".
On the other hand, the appearances dictionary in object 537 references the normal appearance stream:
/N 538 0 R
And the stream in object 538 only contains:
/Tx BMC
q
1 0 0 1 2 -7.331 cm
/Helv 12 Tf
Q
EMC
So the normal appearance stream positions in the field (setting the current transformation matrix accordingly) and selects a font (Helvetica, properly defined in the ressources, BTW), and then prints... nothing!
The interactive form dictionary (object 144) does not contain a NeedAppearances entry at all. According to the PDF specification ISO 32000-1:2008, Table 218, this entry is
A flag specifying whether to construct appearance streams and appearance dictionaries for all widget annotations in the document (see 12.7.3.3, “Variable Text”). Default value: false.
Thus, the PDF viewer acts just like expected when not showing the value "Doc Lightning" of "Name Line 1" but instead the empty appearance stream.

After revisiting this issue, and carefully looking at the source code, I realized that the Sample.java class of PDFClown's samples had an applyDocumentSettings() method that contained three lines of code missing from my source:
//Previously we instantiated "document" from org.pdfclown.files.File.getDocument()
ViewerPreferences view = new ViewerPreferences(document); // Instantiates viewer preferences inside the document context.
document.setViewerPreferences(view); // Assigns the viewer preferences object to the viewer preferences function.
view.setDisplayDocTitle(true);
I'm not sure that the last line is actually necessary, but I went ahead and kept it in for good measure.
The user mkl wrote in his answer that "the PDF viewer generates a new appearance of the stream, an appearance of its own making which it understands completely for the task of editing." It seems that what the lines of code do above is generate an appearance that is understood to be for reading (and maybe editing?).

Related

How to get information about checkbox using itext7 library?

I need to get information about Checkbox for instance, checkbox mark(cross, circle, etc.), checked, unchecked and so forth. But I couldn't understand where this information has kept if in this file no XFA information. This file has been created via Adobe Acrobat Pro DC 19.21.20049. I tried to find out this information using pdfbox tool, but I didn't find it. That is a screenshot of checkbox information:
Can anyone explain to me how to get this information and where I must find it?

For any PdfFormField field with a value you can retrieve that value like this:
PdfObject value = field.getValue();
In case of a checkbox field this will return a PdfName. For an unchecked field, the name is Off. For a checked field it can be anything else even though the specification recommends Yes.
The appearances of these states are more difficult to determine because at least for the checked state a checkbox must provide an appearance stream containing instruction to create a visualization.
We have analyzed one such stream in response to your other recent question. In that case a ZapfDingbats tick symbol was used and you could apply text extraction to the stream and determine this.
In other cases, e.g. in case of a crossed check box, usually only vector graphics instructions are used.
It also is possible, though, to use a bitmap image here which may show anything.
Thus, while you can of course compare the appearance stream with the standard appearances generated by e.g. Adobe Acrobat and so identify the appearance of many checkbox fields, you won't be able to automatically identify all.

How to inject images into a Word template via docx4j without getting them resized

My program injects text and pictures into a Word template. This works great via content control data binding (thanks to docx4j and Content-Control-Toolkit).
My problem is, that images get resized after injection. What I actually want, is the behavoir that Jason decribed here: http://www.docx4java.org/forums/data-binding-java-f16/picture-content-control-size-t634.html
The current behaviour is to just let it be whatever its natural size is (at a given dpi), unless that is greater than page width, in which case it is scaled down.
According to that post, the behavoir of docx4j has been changed so that the pictures always fit the size of the content control with respect to the ratio.
Is it possible to get the "old" behavoir back? Do I have to do that on my own, or is the switch, that Jason wrote about, already implemented?
As the answer to How to force Docx4j to refresh a replaced image file states, the size of a picture is stored in the main document part. At the moment, I only use XPath to set content in the custom XML part. If there is any possibility to get what I need without touching the documents XML directly, I would really prefer that. A macro to set the size after opening the document in Word is no option for me.

The first thing to be aware of is that these days we prefer to have a picture in a rich text content control, as opposed to a picture content control.
This is because Word limits your ability to "float" a picture content control.
The handling for this is triggered by w:tag containing 'od:Handler=picture': datastorage/bind.xslt#L165
The basic behaviour is that if the w:sdtContent contains an existing w:drawing/wp:inline/a:graphic then reuse it, so any formatting thus configured is used.
But for a "legacy" picture content control which doesn't contain a:blip (when would this be?), xpathInjectImage is invoked with wp:extent passed in (see bind.xslt#L240).
At line 1143, if (cxl==0 || cyl==0) // Let BPAI work out size
So if you want the image at its natural size, you could try removing the when clause at bind.xslt#L212
By the way, we can also bind escaped XHTML. But there, we make an effort to fit any image not just to the page width, but if in a table cell, to that as well.

Find invisible text in iText

I am creating a PDF document of multiple pages using iText. I am adding some unique text on one of the pages in the middle of this document but making it invisible as-
Chunk chunk = new Chunk("invisible text here");
chunk.setTextRenderMode(PdfContentByte.TEXT_RENDER_MODE_INVISIBLE, 0f, null);
com.lowagie.text.Document iTextDoc.add(new Paragraph(Element.ALIGN_JUSTIFIED, chunk));
The reason for adding this invisible text is to identify this particular page at the time of onEndPage(). But it is failing.
To achieve in the onEndPage(), I have the following code -
boolean b = (pdfWriter.getDirectContent().toString()).contains("invisible text here");
I get the value of b as false.
If I compare any other text on that page(which is visible) results b as true.
I tried to manually search the invisible text in the PDF reader and it finds the text.
What could I modify to achieve this?

It is never a good idea to assume you can recognize text in the content without elaborate parsing. The text may be split into multiple segments, encoding might not be platform's default character encoding, etc... Thus don't try something like
boolean b = (pdfWriter.getDirectContent().toString()).contains("invisible text here");
You can achieve your goal
The reason for adding this invisible text is to identify this particular page at the time of onEndPage().
much more easily. Simply add a member to your PdfPageEvent implementation, i.e. the class with your onEndPage() method, and set it where you used to add the invisible page content to the text you used to add to the page.
Now you can test that member variable directly in your onEndPage(). Don't forget to reset the variable afterwards, preferably in onEndPage() itself!

PDF find out if text is underlined or a table cell

I have been playing around with PdfBox and PDFTextStripperByArea method.
I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.
As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.
Here is my code so far:
List<TextPosition> textPos = charactersByArticle.get(index);
for (TextPosition t : textPos)
{
if (t.getFont().getFontDescriptor() != null)
{
if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
t.getFont().getFontDescriptor().isForceBold())
{
isBold = true;
}
if (t.getFont().getFontDescriptor().isItalic())
{
isItalic = true;
}
}
}
I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.
Any suggestions where this information could be retrieved from ?

Here is what I have found out so far:
PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.
If we take a look at the PDFTextStripper.properties resource file under:
pdfbox\src\main\resources\org\apache\pdfbox\resources\
we can see that for instance the BT operator is bound to the
org.apache.pdfbox.util.operator.BeginText class and so on.
The PDFTextStripper under
pdfbox\src\main\java\org\apache\pdfbox\util\
takes this into account and utilizes the processing of the PDF with this classes.
BUT all graphical objects are ignored, therefore no information of underline or table structure!
Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under
pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\
The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.
Now this would mean reading the PDF file specification, which is currently way to much work.
If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.

As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.
You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.
My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.
http://learnpostscript.wordpress.com/category/lineto/
PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.
However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.
Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.
PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.
Hope this helps!

you can use Itext to generate pdf reports.
by using itext you can able to put the lines in easy way.
try the follwing.
document.add(new LineSeparator(0.5f, 50, null, 0, 198));
the above code is used to generate lines in pdf report. and set the dimensions according to your choice.
hope this will help you.

As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.

According to the api getfont() returns The font size.
You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.

How to insert content in the middle of a page in a PDF using IText

I have a requirement to insert Content into the middle of the page in a PDF.
The Content may be a Dynamic Table or an Image.
My Concept was to first split the PDF into 2 parts, then get the new Content that is to be added and append by replacing a place holder field.
the Splitting is called Tiling as per IText and here is an example for the same.
http://itextpdf.com/examples/iia.php?id=116
The Code above has 2 drawbacks:
1. It splits the page into 16 parts. but that is part of the example. Still i cant figure out a way to split the file into 2 parts only.
2. secondly the split page is converted to a complete page thus disturbing its proportions.
The Rearranging code is the another problem.
The remaining Content should be re-ordered in append mode. but till yet i have only found codes to add complete new pages rather than just the content.
I have found a code that appends the PDF content by replacing a placeholder:
float[] fieldPosition= pdfTemplate.getAcroFields().getFieldPositions("tableField");
PdfPTable table = buildTable();
PdfContentByte cb = stamper.getOverContent(1);
table.writeSelectedRows(0, -1, fieldPosition[1],fieldPosition[4],cb);
Please help me to solve this requirement.

PDF is a presentation format, not an edition format. In other words, it is not designed to allow content insertion, with the original content reflowing gracefully. As a consequence, no tool (at least, none that I know of, and surely not iText) will enable you to achieve what you were given as a requirement.
My advice :
refuse the assignment since it's not feasible, or
get your hands on the original document, insert the desired extra content, and then convert to PDF.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.