File example: file.
Problem - when extracting text using PdfTextStripper, there is token "9/1/2017" and "387986" after "ASSETS" in the page start which should be removed, and some others hidden tokens.
I have already applied this solution (so I do not copy-paste it here, because actually problem is exactly the same) and still that hidden text is appearing on page. Could it be hidden by something else except clip path?
thanks!
Could it be hidden by something else except clip path?
Yes. In case of your new document the text is written in white on white, e.g. the 387986 after ASSETS is drawn like this:
1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(#A,BAC)Tj
The initial 1 1 1 rg sets the fill color to RGB WHITE. (Additionally that text is quite tiny but would still be visible if drawn in e.g. BLACK.)
The solution you refer to was implemented for documents like the sample document presented in that issue in which the invisible text is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, your white text won't be recognized by it as hidden.
Unfortunately recognizing invisibility of WHITE on WHITE text is more difficult to determine than that of clipped or covered text because one not only needs to know the a property of the current graphics state (like the clip path) or remove all text inside a given path, one also needs to know the color of the part of the page right before the text is drawn (to check the on WHITE detail).
If, on the other hand, you assume the page background to be essentially WHITE, it is fairly simple to ignore all white text: Simply also detect the current fill color in processTextPosition:
PDColor fillColor = gs.getNonStrokingColor();
and compare it to the flavors of WHITE you want to consider invisible. (Usually it should suffice to compare with RGB, CMYK, and Grayscale WHITE; in seldom cases you'll also have to correctly interpret more complex color spaces. Additionally you might also consider nearly WHITE colors invisible, (.99, .99, .99) RGB can hardly be distinguished from WHITE.)
If you find the current color to be WHITE, ignore the current TextPosition.
Be aware, though, just like the solution you referenced this is not yet the final solution recognizing all WHITE text: For that you'll also have to check the text rendering mode: If it is just filling (the default), the above holds, but if it is (also) stroking, you'll (also) have to consider the stroking color; if it is rendered invisible, there is no color to consider; and if the text rendering mode includes adding to path for clipping, you'll have to wait and determine what will be later drawn in this part of the page as long as the clip path holds, definitely not trivial!
Related
The first page at this PDF displays the following white decorated text on top of an image.
When using the PDFBox utility PrintImageLocations, this graphics is not extracted as an image, only the background image is extracted, without the white decorated text. When converting to Word doc, the decorated text is extracted as a shape with properties which can be modified, such as fill color, border color, and much more.
Is it possible to extract that shape from the PDF, using PDFBox? How?
The simplest way to extract such graphics is to reverse engineer those that can be into ScaledVectorGraphics as here I had to change colour from white to magenta otherwise it would look like a snowscape.
I dont use PDFbox so cant say how easy that may be possible .I simply exported page 1 as SVG using
MuPDF\mutool.exe convert -o page1.svg -O no-reuse-images Xcel_Energy-AR2018.pdf 1
However you will get all SVG output such as the lower text and note the extra header text in the top left corner and lower left corner page number that were not visible behind the pixel grapics.
Note: that everything (thus any conventional text and image pixels are converted to SVG objects) there is no easier way to extract all the PostScript Printer style moves and lineto's. So yes it is overkill as it needs parsing to get just the object of interest (more easily done in a GUI such as inkscape or InDesign where it was constructed). It is not a good methodology for shape recognition since the y x values are described as rectangles, and will have positions and scalars that most likely vary from page to page, thus there are no constants other than filled appearance. The filled object would best be "seen" by regeneration as pixels for visual symbol recognition (much like OCR).
I am developing a simple devotional app, which has a Kannada (a language in India) sentence to be displayed. I am successful in using typeface and displaying the content.
In few places I have word which has a line on top/bottom of the word as shown below. I tried with a spannable image but I am still not able to achieve it properly.
This is a sample of the code which I am referring to. Here I am using a small icon to display it in between the string.
Spannable span1 = new SpannableString("The imageplace");
Drawable android = TestImageActivity.this.getResources().getDrawable(R.drawable.end);
android.setBounds(5, 0, 20, 5);
ImageSpan image = new ImageSpan(android, ImageSpan.ALIGN_BASELINE);
span1.setSpan(image, 3, 4, Spannable.SPAN_INCLUSIVE_EXCLUSIVE);
tvTextImage3.setText(span1);
ImageSpan extends ReplacementSpan so any characters you are spanning won't get rendered, as the TextLayout is expecting that the span itself will be doing all the rendering.
What I would recommend is implementing your own ReplacementSpan subclass. Since it looks like your graphics are associated with one character, you would wrap the single character.
In the getSize override, you would use start and end to index into text and get the character(s) you are spanning, then use paint.getTextBounds() to measure the width of the text and return that value. You want the width calculation to work in a way that the width of the span doesn't affect the default spacing of the text.
Another thing this method might need to do is change the FontMetrics by increasing the ascent and descent in order to give you some space to draw the lines.
In the draw override, you use the paint to render the text that isn't being rendered within the span. The paint and font metrics should already have the proper values so that your text render looks like the surrounding text. Of course, you'll also render the line graphics you want.
For some sample code, take a look at my answer to a similar question. This has all the pieces I just discussed.
If you want me to write some code for this, you'll need to provide some code that gives me a starting point with some actual Kannada text along with what the lines are and where they go. I don't even know if Kannada text is LTR or RTL; that might affect how the span subclass is coded. Preferably the text would correspond to the image you posted so I can see how it should look when it's working.
I found out there is a new component in LibGDX in nightly builds - TextArea which is part of the scene2d.ui package. It's nice to have a component like this, very easy to use, but what I'm missing is some support for a multi-colored text.
I want to highlight some keywords in a text with a different color but I don't know how to do it with current api. There is one method in BitmapFontCache class:
public void setColors (Color tint, int start, int end)
Javadoc for this method says following:
Sets the color of the specified characters. This may only be called after setText(CharSequence, float, float) and is reset every time setText is called.
But I don't know how to use it through TextArea object or if it's even possible to do it that way. Someone who tried to figure it out? Every hint will be appreciated.
Libgdx offers color markup, which must first be enabled on the BitmapFont with
font.getData().markupEnabled = true;
Text rendered with that font will look for color markup, where colors are surrounded in brackets. Each used color is pushed onto a stack.
Named colors (case sensitive): [RED]red [ORANGE]orange
Hex colors with optional alpha: [#FF0000]red [#FF000033]transparent
A set of empty brackets pops a color off the stack: [BLUE]Blue text[RED]Red text[]Blue text
A double bracket [[ represents an escaped bracket character, however it will not work as expected when followed by a closing bracket.
Named colors are defined in the class com.badlogic.gdx.graphics.Colors, and can be added with Colors.put("NAME", color);.
Hopefully this isn't super late.
I haven't tried it your way, but I bet you would have to overwrite the setText method and then set the colors for the specific points you want. start and end are indices for the pieces of text you want in that particular color.
I have implemented a MulticolorTextArea here: https://github.com/AnEmortalKid/MulticolorTextArea/tree/mta-release
Hopefully this helps out.
While parsing a already present pdf, I am using
if(op.getOperation().equals( "TJ")) to get text operators, What I want to do is to target only the ones whose color is black(or some other specifiable color). I am unable to find a method for the same in pdfBox docs.
Edit : Basically what I want to do is to keep only black colored text on the pdf, and remove/delete any other text operator which doesnt match the criteria.
Can anyone share a solution ?
Thanks !
Text showing operators
While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators,
There are more text showing operators you have to take care of in general:
string Tj Show a text string.
string ' Move to the next line and show a text string. This operator shall have the
same effect as the code T* string Tj
aw ac string " Move to the next line and show a text string, using aw as the word spacing and ac as the character spacing (setting the corresponding parameters in the text state). aw and ac shall be numbers expressed in unscaled text space units. This operator shall have the same effect as this code: aw Tw ac Tc string '
array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount.
(Table 109 in the Pdf specification ISO 32000-1)
Text color
The color used to show text depends on the current text rendering mode.
The text rendering mode, Tmode, determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three.
(section 9.3.6 in the Pdf specification ISO 32000-1)
It is set using the Tr operator:
render Tr Set the text rendering mode, Tmode, to render, which shall be an integer. Initial value: 0.
(Table 105 in the Pdf specification ISO 32000-1)
Depending on this mode you have to consider the current stroke color, the current fill color, the color of whatever is later-on painted in the defined clipping boundary, or some combination of the three.
The color setting operators are defined in Table 74 of the specification ISO 32000-1.
Most often the glyph outlines merely are filled (mode 0). Thus, most often you have to consider the current fill color. That still leaves quite a lot of color setting commands to consider.
Most often gray, RGB, or CMYK colors are used here. Thus, most often you will have to check
the g, rg, or k operators.
Pure black is set by 0 g, 0 0 0 rg, or 0 0 0 1 k. You might also want to consider values which are very near to those values; they might have been intended as black and only differ due to rounding issues.
Color transformations
To make things a bit more complex: The colors mentioned above may still be transformed to some completely different color, e.g. by means of transfer functions (cf. section 10.4), transparency or blending (cf. section 11).
If you also want to consider these effects, you essentially program your own PDF renderer.
Normally, though, PDFs intended mainly for text on the web don't use these features. Thus, for your purposes I would not consider them at first.
I am using itext and want to make my acrofields curved. Say textfield with rounded corners,
and apply the same to buttons and imageField(pustButtonField).
Is it possible in itext or by using some other api.
Thanks in advance for everybody valuable reply...
Depending on what you're actually asking, there are three possible answers to this question:
ISO-32000 only allows you to define a rectangle as the clickable area of AcroForm fields. This is the area that is highlighted when you select highlight fields. You can define a border for this rectangle consisting of an array containing at least 3 values: the horizontal corner radius, the vertical corner radius and the border width. An optional fourth value allows you to define the dash pattern.
Apart from this you can create any appearance for a widget annotation that corresponds with an AcroForm field. The appearance is stored in the /AP entry of the annotation dictionary. This is quite common for button fields (see for instance the createAppearance() method in the Calculator example). This is not done for text fields, as the appearance will disappear the moment somebody changes the value of the text field.
Maybe you are asking to create a rectangular border that is part of the content stream of the page as opposed to a shape that is defined at the AcroField level (see for instance how Open Office adds border to form fields: these borders don't disappear when you remove the field dictionary).
If I had to guess, I'd say you're looking for answer 2 regarding buttons and for answer 3 regarding text fields.
Update: thank you for accepting the answer even though I misunderstood the question. You were asking for a field where you define a path (not necessarily a straight line) that will be used to position the value of the field (for instance: a word written in a way that the characters form a circle). That's not possible with AcroForm fields.