I have been playing around with PdfBox and PDFTextStripperByArea method.
I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.
As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.
Here is my code so far:
List<TextPosition> textPos = charactersByArticle.get(index);
for (TextPosition t : textPos)
{
if (t.getFont().getFontDescriptor() != null)
{
if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
t.getFont().getFontDescriptor().isForceBold())
{
isBold = true;
}
if (t.getFont().getFontDescriptor().isItalic())
{
isItalic = true;
}
}
}
I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.
Any suggestions where this information could be retrieved from ?
Here is what I have found out so far:
PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.
If we take a look at the PDFTextStripper.properties resource file under:
pdfbox\src\main\resources\org\apache\pdfbox\resources\
we can see that for instance the BT operator is bound to the
org.apache.pdfbox.util.operator.BeginText class and so on.
The PDFTextStripper under
pdfbox\src\main\java\org\apache\pdfbox\util\
takes this into account and utilizes the processing of the PDF with this classes.
BUT all graphical objects are ignored, therefore no information of underline or table structure!
Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under
pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\
The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.
Now this would mean reading the PDF file specification, which is currently way to much work.
If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.
As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.
You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.
My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.
http://learnpostscript.wordpress.com/category/lineto/
PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.
However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.
Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.
PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.
Hope this helps!
you can use Itext to generate pdf reports.
by using itext you can able to put the lines in easy way.
try the follwing.
document.add(new LineSeparator(0.5f, 50, null, 0, 198));
the above code is used to generate lines in pdf report. and set the dimensions according to your choice.
hope this will help you.
As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.
According to the api getfont() returns The font size.
You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.
Related
I have a text editor in which the user can define a pattern for data. A pattern can contain a reference to another pattern. A simple example would be { name: $pattern2 }, where $pattern2 refers to a different pattern which will later be substituted into this one. However, patterns are identified by a UUID, not by a name, which makes it unwieldy to use. To compensate, users have a bunch of buttons that insert the UUID into the editor for them. However, UUIDs are not nice to look at.
A visually appealing way might be possible using a JTextPane (or JEditorPane) to replaces the UUID with a small square containing the referenced pattern's name. The first image below shows what the actual contents of the Document are, while the second image shows how I imagine it is displayed to the user, given two other patterns with the names GoodName and Date.
I have tried using the javax.swing.text.Style object obtained from jTextPane.addStyle("", null), I tried with a javax.swing.text.StyleContext, and I looked at extending javax.swing.text.DefaultStyledDocument, but those mostly seem to concern themselves with changing the way the font is rendered. I don't really see a way to render specific text as a non-text shape. Furthermore, I don't see a way of making the reference "atomic", in the sense that the user cannot select and remove half of the UUID, which would reveal the underlying data which I'm trying to keep hidden.
Is anyone aware of a component that supports behaviour like this? Am I overlooking a Swing feature here? Swing has loads of documentation but it's hard to find what you're looking for if you don't know what it's called.
Perhaps this is a case where a JLayer can be successfully applied.
Instead of replacing the UUIDs with small rectangles, an equally large rectangle containing the name could be drawn over each UUID.
My program injects text and pictures into a Word template. This works great via content control data binding (thanks to docx4j and Content-Control-Toolkit).
My problem is, that images get resized after injection. What I actually want, is the behavoir that Jason decribed here: http://www.docx4java.org/forums/data-binding-java-f16/picture-content-control-size-t634.html
The current behaviour is to just let it be whatever its natural size is (at a given dpi), unless that is greater than page width, in which case it is scaled down.
According to that post, the behavoir of docx4j has been changed so that the pictures always fit the size of the content control with respect to the ratio.
Is it possible to get the "old" behavoir back? Do I have to do that on my own, or is the switch, that Jason wrote about, already implemented?
As the answer to How to force Docx4j to refresh a replaced image file states, the size of a picture is stored in the main document part. At the moment, I only use XPath to set content in the custom XML part. If there is any possibility to get what I need without touching the documents XML directly, I would really prefer that. A macro to set the size after opening the document in Word is no option for me.
The first thing to be aware of is that these days we prefer to have a picture in a rich text content control, as opposed to a picture content control.
This is because Word limits your ability to "float" a picture content control.
The handling for this is triggered by w:tag containing 'od:Handler=picture': datastorage/bind.xslt#L165
The basic behaviour is that if the w:sdtContent contains an existing w:drawing/wp:inline/a:graphic then reuse it, so any formatting thus configured is used.
But for a "legacy" picture content control which doesn't contain a:blip (when would this be?), xpathInjectImage is invoked with wp:extent passed in (see bind.xslt#L240).
At line 1143, if (cxl==0 || cyl==0) // Let BPAI work out size
So if you want the image at its natural size, you could try removing the when clause at bind.xslt#L212
By the way, we can also bind escaped XHTML. But there, we make an effort to fit any image not just to the page width, but if in a table cell, to that as well.
I was going through the itext api docs & I was able create a pdf with a watermark image or text but did not find a method to get/extract watermark content from pdf.
So I have a pdf document containing watermarked text/image & I want to extract that text or img and validate which I am not able to do.
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
By validate I mean if I have an existing pdf/image with some watermarked text [as done in 2nd link in above ref], I want to check whether it has expected text/image.
References:
http://itextpdf.com/themes/keyword.php?id=226
http://www.java-connect.com/itext/add-watermark-in-PDF-document-using-java-iText-library.html
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
Extracting watermark content?
There is nothing special about watermarks in PDFs in contrast to regular page content. They merely
appear pretty early in the content stream and other content later in the stream, therefore, is drawn above it; or they
appear pretty late in the content stream but have some kind of transparency applied.
Actually there is another type of watermarks which is special, the so-called Watermark Annotations. As these annotation can easily be lost when documents are merged or otherwise manipulated, though, they hardly ever are used.
Furthermore different PDF generating software suites offering a way to add watermarks do so in their respective individual way. Thus, you cannot even recognize watermarks by some special operations done in some specific unique pattern.
Already the iText examples you referred to apply different kinds of watermarks
MovieCountries2 simply draws some gray large Text using an angled base line.
StampStationery copies a complete page from some PDF (which itself may visually have foreground and background material) into a separate object inside the target PDF and adds a reference to this object at the beginning of every page of the target.
InsertPages similarly references a page from some PDF on every newly generated target document page.
Thus, blind watermark extraction is virtually impossible.
Validating watermark content!
You might try some validation, though, if you know what you are searching for. You simply do not merely search some (in PDF not existing) fixed watermark stream but instead the whole page content.
iText offers the classes of the parser package which allow extraction of text and/or bitmap images from content streams. Look at the samples referenced from the keywords PARSING PDF > EXTRACTING IMAGES and PARSING PDF > EXTRACTING TEXT.
You merely have to check whether the image or text which you expect can be found by these classes positioned and styled as you expect.
I'm trying to write a program that reads a docx file and checks whether some of the text is colored. For instance, imagine if all the words bolded in this sentence were actually written in some arbitrary color. I want my program to recognize that the words "words bolded in this sentence were actually written in some arbitrary color" are colored.
Then after recognizing the coloration, I want to be able to edit the recognized text based on the color. For instance, if the the bolded text above were red, I want to add "Red>" tags around the text, while still keeping intact the rest of the sentence that isn't colored.
I was originally using ZipInputStream and ZipEntry to get the "word/document.xml," and I had planned on pulling the text and colors from there, but I feel like that would get too confusing after a while. I also tried using Apache poi, but I don't think it's able to recognize colors. Docx4j looks promising, though. Any thoughts, suggestions, or sample code to get me started?
Font color is a run property:
<w:r>
<w:rPr>
<w:color w:val="FF0000"/>
</w:rPr>
<w:t>red</w:t>
</w:r>
docx4j provides three ways to do stuff with that:
via XPath
via TraversalUtil
via XSLT
I'd recommend TraversalUtil, since XPath is dependent on JAXB's support for it, which isn't always robust (at least in the Sun/Oracle reference implementation).
See the finders package for examples of using this.
But beyond this, the challenge you face is that the color property could be specified via a style (or even as a document default). If you want to take this into account, you need to be looking at the effective run properties (which is what docx4j's PDF output does).
I have a library which generates pdf document with images.
I want to be able to add text after each image. What is the syntax for that? How to insert text into pdf documents?
I have to use the library I have, not another one.
First of all, mkl is correct, have a look at the specification for all of the details. PDF is an exact language, if you make mistakes they will routinely be punished severely once you open the PDF in viewers.
Secondly, when you think about putting text on the page, don't forget that besides the text operators to draw the text on the page, you'll also have to specify the font to use to draw this text. Which will include making sure there is a font resource included in the PDF file if your library doesn't automatically handle all of that for you.
If you want to cut corners (I shiver while writing this) and perhaps don't read the specification as thoroughly, try this.
1) Create a PDF file that looks more or less like what you want.
2) Use a tool such as pdfToolbox from callas (http://www.callassoftware.com/callas/doku.php/en:download) or Browser from Enfocus (http://www.enfocus.com/en/products/browser). Both of these tools allow you to investigate the low-level structure of a PDF file, including looking at the actual page description code. This will show you how fonts are embedded (if you have to do it yourself that could be very handy) and how text is rendered on the page (and how you set the font, size, color etc... to use).