I'm trying to write a program that reads a docx file and checks whether some of the text is colored. For instance, imagine if all the words bolded in this sentence were actually written in some arbitrary color. I want my program to recognize that the words "words bolded in this sentence were actually written in some arbitrary color" are colored.
Then after recognizing the coloration, I want to be able to edit the recognized text based on the color. For instance, if the the bolded text above were red, I want to add "Red>" tags around the text, while still keeping intact the rest of the sentence that isn't colored.
I was originally using ZipInputStream and ZipEntry to get the "word/document.xml," and I had planned on pulling the text and colors from there, but I feel like that would get too confusing after a while. I also tried using Apache poi, but I don't think it's able to recognize colors. Docx4j looks promising, though. Any thoughts, suggestions, or sample code to get me started?
Font color is a run property:
<w:r>
<w:rPr>
<w:color w:val="FF0000"/>
</w:rPr>
<w:t>red</w:t>
</w:r>
docx4j provides three ways to do stuff with that:
via XPath
via TraversalUtil
via XSLT
I'd recommend TraversalUtil, since XPath is dependent on JAXB's support for it, which isn't always robust (at least in the Sun/Oracle reference implementation).
See the finders package for examples of using this.
But beyond this, the challenge you face is that the color property could be specified via a style (or even as a document default). If you want to take this into account, you need to be looking at the effective run properties (which is what docx4j's PDF output does).
Related
I am using Apache POI's HSLFSlideShowand extracting the paragraphs individually using:
String paragraph = currentSlide.getTextParagraphs().get(i).toString();
However I am unable to figure out how to extract formatting information, such as bold, italic, text size and typeface.
I did come across the CharacterRun, which seems to have the features I need, however it does not seem to be applicable to HSLFSlideShow (unless I'm using it wrong).
My overall task can be found here.
I have been playing around with PdfBox and PDFTextStripperByArea method.
I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.
As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.
Here is my code so far:
List<TextPosition> textPos = charactersByArticle.get(index);
for (TextPosition t : textPos)
{
if (t.getFont().getFontDescriptor() != null)
{
if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
t.getFont().getFontDescriptor().isForceBold())
{
isBold = true;
}
if (t.getFont().getFontDescriptor().isItalic())
{
isItalic = true;
}
}
}
I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.
Any suggestions where this information could be retrieved from ?
Here is what I have found out so far:
PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.
If we take a look at the PDFTextStripper.properties resource file under:
pdfbox\src\main\resources\org\apache\pdfbox\resources\
we can see that for instance the BT operator is bound to the
org.apache.pdfbox.util.operator.BeginText class and so on.
The PDFTextStripper under
pdfbox\src\main\java\org\apache\pdfbox\util\
takes this into account and utilizes the processing of the PDF with this classes.
BUT all graphical objects are ignored, therefore no information of underline or table structure!
Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under
pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\
The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.
Now this would mean reading the PDF file specification, which is currently way to much work.
If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.
As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.
You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.
My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.
http://learnpostscript.wordpress.com/category/lineto/
PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.
However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.
Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.
PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.
Hope this helps!
you can use Itext to generate pdf reports.
by using itext you can able to put the lines in easy way.
try the follwing.
document.add(new LineSeparator(0.5f, 50, null, 0, 198));
the above code is used to generate lines in pdf report. and set the dimensions according to your choice.
hope this will help you.
As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.
According to the api getfont() returns The font size.
You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.
I have a library which generates pdf document with images.
I want to be able to add text after each image. What is the syntax for that? How to insert text into pdf documents?
I have to use the library I have, not another one.
First of all, mkl is correct, have a look at the specification for all of the details. PDF is an exact language, if you make mistakes they will routinely be punished severely once you open the PDF in viewers.
Secondly, when you think about putting text on the page, don't forget that besides the text operators to draw the text on the page, you'll also have to specify the font to use to draw this text. Which will include making sure there is a font resource included in the PDF file if your library doesn't automatically handle all of that for you.
If you want to cut corners (I shiver while writing this) and perhaps don't read the specification as thoroughly, try this.
1) Create a PDF file that looks more or less like what you want.
2) Use a tool such as pdfToolbox from callas (http://www.callassoftware.com/callas/doku.php/en:download) or Browser from Enfocus (http://www.enfocus.com/en/products/browser). Both of these tools allow you to investigate the low-level structure of a PDF file, including looking at the actual page description code. This will show you how fonts are embedded (if you have to do it yourself that could be very handy) and how text is rendered on the page (and how you set the font, size, color etc... to use).
I have an word template. There is an word photo that has to be replaced with an image. This has to be done with Docx4Java.
How do I do this?
If specifically looking to replace a text with an image(which is not possible using docx4j as answered above), you can use replace bookmark with image as an alternative.
Just open your templated word file, position the cursor at desired location and insert->bookmark and name your bookmark.
I followed the instructions here to replace this bookmark with an image
Disclosure: I manage the docx4j project
The VariableReplace code doesn't handle images.
The best way to do this would be to use data bound content controls, specifically a picture content control pointing via XPath at a base-64 encoded image in an XML document (see Getting Started for details).
However, if you want to replace a word with an image, you can do so, but you'll have to write a bit of glue code. It is pretty straightforward.
First, find the word. You can do this using XPath or TraversalUtil (again, see Getting Started for details).
Hopefully it is in a run (w:r/w:t) by itself. If not, you'll need to split the run up so you don't replace adjacent text.
Then, add the image. See the sample ImageAdd.
I suggest you have a look at the XML created when you add an image in Word (ie save and unzip your docx, then look at document.xml). Take care that the XML representing the image is at the correct level (eg child of w:p).
In my Java application I want to output striked letters (like html tag do). Is there any way to do this using Unicode (combine )
You can use U+0336, the combining long stroke overlay, to accomplish this task.
The official Combining Diacritical Marks unicode chart lists "strikethrough" as an 'informative alias', meaning that this is is the official specified purpose of this character.
0336 ̶◌ COMBINING LONG STROKE OVERLAY
= strikethrough
• connects on left and right
For comparison, here is U+0336 compared to the html <strike> tag:
U̶n̶i̶c̶o̶d̶e ̶c̶o̶m̶b̶i̶n̶i̶n̶g̶ ̶l̶o̶n̶g̶ ̶s̶t̶r̶o̶k̶e̶ ̶o̶v̶e̶r̶l̶a̶y̶
Hypertext strike tag
Note that many font rendering engines do not render U+0336 correctly, and when possible one should use markup formatting or another mechanism. Depending on your browser, the above text likely has a large gap in the line around the "m" in combining, and #Alex78191 reports that it renders so low for them that it looks more like an underline than a strikethrough.
For this reason, one should still prefer HTML another markup technology over U+0336 for this purpose, given the option.
No, this is not possible. While there is the concept of a stroke as diacritic, it's not available as a separate Unicode character, probably because the various letters that use a stroke diacritic do not place it at the same height or even angle. So the result would not resemble strikethrough markup anyway.
To output strikethrough text in Java, you need to use an output format that allows you to use explicit markup. If you have a Swing app, you're in luck as many Swing components support HTML. Otherwise it depends on what presentation technology you're using.
As said before, Unicode doesn't do that, but a lot of Swing components understand basic HTML tags.
JLabel label = new JLabel("<html><s>My stroke</s></html>")
No. Unicode does not define a combining strikeout mark. Unicode's view is that this is the job of markup -- like HTML.