I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose.
So far I have done extraction of text from comments using below code.
PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Total annotations = " + la.size());
System.out.println("\nProcess Page " + pageNum + "...");
// Just get the first annotation for testing
PDAnnotation pdfAnnot = la.get(0);
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
Now I need to get the highlighted text, any code example will be highly appreciated.
I Hope this answer help everyone who is facing the same problem.
// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
#SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
ArrayList<String> highlightedTexts = new ArrayList<>();
// this is the in-memory representation of the PDF document.
// this will load a document from a file.
PDDocument document = PDDocument.load(filePath);
// this represents all pages in a PDF document.
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
// this represents a single page in a PDF document.
PDPage page = allPages.get(pageNumber);
// get annotation dictionaries
List<PDAnnotation> annotations = page.getAnnotations();
for(int i=0; i<annotations.size(); i++) {
// check subType
if(annotations.get(i).getSubtype().equals("Highlight")) {
// extract highlighted text
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
String str = null;
for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
COSFloat ULX = (COSFloat) quadsArray.get(0+k);
COSFloat ULY = (COSFloat) quadsArray.get(1+k);
COSFloat URX = (COSFloat) quadsArray.get(2+k);
COSFloat URY = (COSFloat) quadsArray.get(3+k);
COSFloat LLX = (COSFloat) quadsArray.get(4+k);
COSFloat LLY = (COSFloat) quadsArray.get(5+k);
COSFloat LRX = (COSFloat) quadsArray.get(6+k);
COSFloat LRY = (COSFloat) quadsArray.get(7+k);
k+=8;
float ulx = ULX.floatValue() - 1; // upper left x.
float uly = ULY.floatValue(); // upper left y.
float width = URX.floatValue() - LLX.floatValue(); // calculated by upperRightX - lowerLeftX.
float height = URY.floatValue() - LLY.floatValue(); // calculated by upperRightY - lowerLeftY.
PDRectangle pageSize = page.getMediaBox();
uly = pageSize.getHeight() - uly;
Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
stripperByArea.addRegion("highlightedRegion", rectangle_2);
stripperByArea.extractRegions(page);
String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");
if(j > 1) {
str = str.concat(highlightedText);
} else {
str = highlightedText;
}
}
highlightedTexts.add(str);
}
}
document.close();
return highlightedTexts;
}
The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.
Having studied this code, the OP still wondered in a comment:
But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?
In general the area an annotation refers to is a rectangle:
Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.
(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)
For some annotations types (e.g. text markups), this location value does not suffice because:
text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.
To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)
Thus, instead of the rectangle given by
PDRectangle rect = pdfAnnot.getRectangle();
in the code in the referenced question, you have to consider the quadrilaterals given by
COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));
and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.
PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).
Related
I try to extract some text out of a PDF. For that I need to define a rectangle that contains the text.
I recognized that the coordinates may have a different meaning when I compare the coordinates from extraction of text to coordinates of drawing.
package MyTest.MyTest;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;
public class MyTest
{
public static void main (String [] args) throws Exception
{
PDDocument pd = PDDocument.load (new File ("my.pdf"));
PDFTextStripperByArea st = new PDFTextStripperByArea ();
PDPage pg = pd.getPage (0);
float h = pg.getMediaBox ().getHeight ();
float w = pg.getMediaBox ().getWidth ();
System.out.println (h + " x " + w + " in internal units");
h = h / 72 * 2.54f * 10;
w = w / 72 * 2.54f * 10;
System.out.println (h + " x " + w + " in mm");
int X = 85;
int Y = 175;
int dX = 250;
int dY = 15;
// extract some text
st.addRegion ("a", new Rectangle (X, Y, dX, dY));
st.extractRegions (pg);
String text = st.getTextForRegion ("a");
System.out.println("text="+text);
// fill a rectangle
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
contents.setNonStrokingColor (Color.RED);
contents.addRect (X, Y, dX, dY);
contents.fill ();
contents.close ();
pd.save ("x.pdf");
}
}
The text I extract (output of text= in the console) is not the text I overdraw with my red rectangle (generated x.pdf).
Why??
For testing try some PDF you already have. To avoid a lot of try/error in aiming for a rectangle with text in it use a file with a lot of text.
There are (at least) two issues in your approach:
Different coordinate systems
You use st.addRegion. Its JavaDoc comment tells us:
/**
* Add a new region to group text by.
*
* #param regionName The name of the region.
* #param rect The rectangle area to retrieve the text from. The y-coordinates are java
* coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
*/
public void addRegion( String regionName, Rectangle2D rect )
(Actually the whole text extraction apparatus of PDFBox uses its own coordinate system, and there already have been many questions on stack overflow because of irritations this caused.)
On the other hand contents.addRect does not use those "java coordinates". Thus, you have to subtract the y coordinate you use in text extraction from the maximum crop box y coordinate to get a coordinate for addRect.
Furthermore, the region rectangles have their anchor point at the top left while the regular PDF rectangles (like the one you define with contents.addRect) have it at the bottom left. Thus, you additionally have to add or subtract the rectangle height from the y coordinate.
Actually you may have to change the x coordinate, too. It is not mirrored but there may be a shift, the PDFBox text extraction coordinate system uses x=0 for the left page border but that is not necessarily the case in PDF user space. Thus, you may have to add the left border x coordinate of the crop box to your text extraction x coordinate.
Possibly changed coordinate system
In the page content stream the coordinate system may have been changed by applying a transformation to the current transformation matrix. As a result the coordinates in the instructions you append to it may have a different meaning than even outlined above.
To rule out such an effect, you should use a different PDPageContentStream constructor with an additional boolean resetContext parameter:
/**
* Create a new PDPage content stream.
*
* #param document The document the page is part of.
* #param sourcePage The page to write the contents to.
* #param appendContent Indicates whether content will be overwritten, appended or prepended.
* #param compress Tell if the content stream should compress the page contents.
* #param resetContext Tell if the graphic context should be reset. This is only relevant when
* the appendContent parameter is set to {#link AppendMode#APPEND}. You should use this when
* appending to an existing stream, because the existing stream may have changed graphic
* properties (e.g. scaling, rotation).
* #throws IOException If there is an error writing to the page contents.
*/
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
boolean compress, boolean resetContext) throws IOException
I.e. replace
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
by
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);
As seen above,i used pdfTable as the body part of the page. Now i want to fill the entire body with the border,column,row of the table, if the table does not fill the entire body part.As shown in the picture below.
Thanks very much!
You can sort of cheat to accomplish this if you do it while you are creating the table.
I'll offer a partial solution that takes advantage of one of the complexities tables in PDFs: Tables are just lines in PDFs. They aren't structured content.
You can take advantage of this though- keep track of where you are are drawing vertical lines while rendering the table and simply continue them to the bottom of the page.
Let's create a new cell event. It keeps track of 4 things: left which is the far left x coordinate of the table, right which is the far right x coordinate of the table, xCoordinates which is a set of all the x coordinates we draw vertical lines, and finally cellHeights which is a list off all the cell heights.
class CellMarginEvent implements PdfPCellEvent {
Set<Float> xCoordinates = new HashSet<Float>();
Set<Float> cellHeights = new HashSet<Float>();
Float left = Float.MAX_VALUE;
Float right = Float.MIN_VALUE;
public void cellLayout(PdfPCell pdfPCell, Rectangle rectangle, PdfContentByte[] pdfContentBytes) {
this.xCoordinates.add(rectangle.getLeft());
this.xCoordinates.add(rectangle.getRight());
this.cellHeights.add(rectangle.getHeight());
left = Math.min(left,rectangle.getLeft());
right = Math.max(right, rectangle.getRight());
}
public Set<Float> getxCoordinates() {
return xCoordinates;
}
}
We'll then add all of our cells to the table, but not add the table to the document just yet
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUT_FILE));
document.open();
PdfPTable table = new PdfPTable(4);
CellMarginEvent cellMarginEvent = new CellMarginEvent();
for (int aw = 0; aw < 320; aw++) {
PdfPCell cell = new PdfPCell();
cell.addElement(new Paragraph("Cell: " + aw));
cell.setCellEvent(cellMarginEvent);
table.addCell(cell);
}
No we add get top- the top position of our table, and add the table to the document.
float top = writer.getVerticalPosition(false);
document.add(table);
Then we draw the vertical and horizontal lines of the completed table. For the height of each cell I just used the first element in cellHeights.
Set<Float> xCoordinates = cellMarginEvent.getxCoordinates();
//Draw the column lines
PdfContentByte canvas = writer.getDirectContent();
for (Float x : xCoordinates) {
canvas.moveTo(x, top);
canvas.lineTo(x, 0 + document.bottomMargin());
canvas.closePathStroke();
}
Set<Float> cellHeights = cellMarginEvent.cellHeights;
Float cellHeight = (Float)cellHeights.toArray()[0];
float currentPosition = writer.getVerticalPosition(false);
//Draw the row lines
while (currentPosition >= document.bottomMargin()) {
canvas.moveTo(cellMarginEvent.left,currentPosition);
canvas.lineTo(cellMarginEvent.right,currentPosition);
canvas.closePathStroke();
currentPosition -= cellHeight;
}
And finally close the document:
document.close()
Example output:
Note that the only reason I say this is an incomplete examples is because there may be some adjustments you need to make to top in the case of header cells, or there may be custom cell styling (background color, line color, etc) you need to account for yourself.
I'll also note another downfall that I just thought of- in the case of tagged PDFs this solution fails to add tagged table cells, and thus would break compliance if you have that requirement.
I am trying to add tiled diagonal watermarks to the pdf, but it seems that pattern fills in iText are always tiled from the bottom left of the page, meaning that the tiles at the top and right side of the page can be cut abruptly. Is there an option to tile from the top left or with an offset instead?
Here is a sample of the code:
List<String> watermarkLines = getWatermarkLines();
Rectangle watermarkRect = getWatermarkRect();
PdfContentByte over = stamper.getOverContent(1);
PdfPatternPainter painter = over.createPattern(watermarkRect.getWidth(), watermarkRect.getHeight();
for (int x = 0; x < watermarkLines.size(); x++) {
AffineTransform trans = getWatermarkTransform(watermarkLines, x);
ColumnText.showTextAligned(painter, 0, watermarkLines.get(x), (float) trans.getTranslateX(), (float) trans.getTranslateY(), 45f);
}
over.setColorFill(new PatternColor(painter));
over.rectangle(0, 0, pageSize.getWidth(), pageSize.getHeight());
over.fill();
I tried changing the x and y of the rectangle function to negative or positive values, but it seems that the watermark is still stamped in the pattern as if it was tiled from the bottom left, cutting it in the same place as before.
First of, I cannot fathom which iText version you are using,
List<String> watermarkLines = getWatermarkLines();
...
ColumnText.showTextAligned(painter, 0, watermarkLines.get(x), (float) trans.getTranslateX(), (float) trans.getTranslateY(), 45f);
implies that the third parameter of the ColumnText.showTextAligned method you use is typed as String or Object. The iText 5 version I have at hand, though, requires a Phrase there. Below I'll show how to apply an offset with the current iText 5.5.13. You'll have to check whether it also works for your version.
Yes, you can apply an offset... in the pattern definition!
If instead of
PdfPatternPainter painter = over.createPattern(watermarkRect.getWidth(), watermarkRect.getHeight());
you create the pattern like this
PdfPatternPainter painter = over.createPattern(2 * watermarkRect.getWidth(), 2 * watermarkRect.getHeight(),
watermarkRect.getWidth(), watermarkRect.getHeight());
you have the same step size of pattern application (watermarkRect.getWidth(), watermarkRect.getHeight()) but a canvas twice that width and twice that height to position you text on. By positioning the text with an offset, you effectively move the whole pattern by that offset.
E.g. if you calculate the offsets as
Rectangle pageSize = pdfReader.getCropBox(1);
float xOff = pageSize.getLeft();
float yOff = pageSize.getBottom() + ((int)pageSize.getHeight()) % ((int)watermarkRect.getHeight());
and draw the text using
ColumnText.showTextAligned(painter, 0, new Phrase(watermarkLines.get(x)), (float) trans.getTranslateX() + xOff, (float) trans.getTranslateY() + yOff, 45f);
the pattern should fill the page as if starting at the top left corner of the visible page.
You haven't supplied getWatermarkLines, getWatermarkRect, and getWatermarkTransform. If I use
static AffineTransform getWatermarkTransform(List<String> watermarkLines, int x) {
return AffineTransform.getTranslateInstance(6 + 15*x, 6);
}
static Rectangle getWatermarkRect() {
return new Rectangle(65, 50);
}
static List<String> getWatermarkLines() {
return Arrays.asList("Test line 1", "Test line 2");
}
your original code for me creates a top left corner like this
and the code with the above offset creates one like this
I am shrinking pdf using below code. Before shrinking PDF pages can be seen in Portrait, but after shrinking their orientation is changing to Landscape. When I print rotation of page before shrinking it is coming as 270 degree. What is causing page to rotate after shrinking? (The PDF which i am trying to shrink has old scanned images)
public void shrinkPDF(String strFilePath , String strFileName) throws Exception {
PdfReader reader = new PdfReader(strFilePath+"//"+strFileName);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(strFilePath+"//Shrink_"+strFileName));
int n = reader.getNumberOfPages();
for (int p = 1; p <= 1; p++) {
float offsetX = (reader.getPageSize(p).getWidth() * (1 - xPercentage)) / 2;
float offsetY = (reader.getPageSize(p).getHeight() * (1 - yPercentage)) / 2;
PdfDictionary page;
PdfArray crop;
PdfArray media;
page = reader.getPageN(p);
System.out.println("reader.getPateRoatation-->"+reader.getPageRotation(p));
media = page.getAsArray(PdfName.CROPBOX);
if (media == null) {
media = page.getAsArray(PdfName.MEDIABOX);
}
crop = new PdfArray();
crop.add(new PdfNumber(0));
crop.add(new PdfNumber(0));
crop.add(new PdfNumber(media.getAsNumber(2).floatValue()));
crop.add(new PdfNumber(media.getAsNumber(3).floatValue()));
page.put(PdfName.MEDIABOX, crop);
page.put(PdfName.CROPBOX, crop);
Rectangle mediabox = reader.getPageSize(p);
stamper.getUnderContent(p).setLiteral(
String.format("\nq %s %s %s %s %s %s cm\nq\n",
xPercentage, mediabox.getLeft(),mediabox.getBottom(), yPercentage, offsetX, offsetY));
stamper.getOverContent(p).setLiteral("\nQ\nQ\n");
}
stamper.close();
reader.close();
}
The cause
The cause for the issue is a feature of iText:
iText tries to simplify adding information to a rotated page by starting both the overcontent and the undercontent with a rotation of the current transformation matrix. This makes additions to the page appear upright in a PDF viewer without the need to add individual rotations.
Even though the undercontent is drawn before the original page content, this normally has no effect on that original content because the whole undercontent is enveloped in a save-graphics-state / restore-graphics-state instruction pair.
The literal you use as undercontent, though, contains two save-graphics-state instructions and no restore-graphics-state instruction. This makes the added rotation suddenly affect the original content, too. Thus, your original content is rotated even though you only want to scale.
The fix
iText allows you to switch off the feature described above. You can do so by setting the PdfStamper property RotateContents to false right after creating the PdfStamper:
PdfStamper stamper = new PdfStamper(reader, result);
stamper.setRotateContents(false);
int n = reader.getNumberOfPages();
Now iText won't add that rotation anymore to the undercontent and your original only is scaled.
The PdfStamper property RotateContents has been discussed more deeply in this answer.
Annotation considerations
iText does not only add the rotation to undercontent and overcontent of page content streams, it also manipulates the dimensions of annotations added to rotated pages, and unfortunately the PdfStamper property RotateContents is not taken into account for that.
A work-around in that case is to temporarily remove the page Rotation entry before adding the annotation to the page and to put it back again later. This has already been discussed in more detail in this answer, this answer, and this answer.
Your remaining code
Your changes to crop box and media box seem unnecessary and might have unexpected and undesired results.
You add the shrinking like this:
stamper.getUnderContent(p).setLiteral(
String.format("\nq %s %s %s %s %s %s cm\nq\n",
xPercentage, mediabox.getLeft(),mediabox.getBottom(), yPercentage, offsetX, offsetY));
Setting the second and third parameter to mediabox.getLeft() and mediabox.getBottom() respectively often will have no bad effect (as these values often are 0) but in some cases you'll experience extremely distorted views of (enlarged parts of) your page.
Im working with PDFClown to analyze and work with PDFDocuments. My aim is to highlight all numbers within a table. For all numbers which belong together (For example: All numbers in one column of a table) I will create one TextMarkup with a List of Quads. First of all it looks like everythink work well: All highlights on the left belong to one TextMarkup and all Highlights on the right belong to another TextMarkup.
But when analyzing the size of the TextMarkup the size is bigger than it looks at the picture. So when drawing for example a rectangle arround the left TextMarkup box the rectangle intersects the other column despite no highlight of the left TextMarkup intersects the other column. Is there a way to optimize the Box of the TextMarkup? I think there is a bulbous ending of the box so that the box is intersecting the other TextMarkup
This is the code which creates the TextMarkup:
List<Quad> highlightQuads = new ArrayList<Quad>();
for (TextMarkup textMarkup : textMarkupsForOneAnnotation) {
Rectangle2D textBox = textMarkup.getBox();
Rectangle2D.Double rectangle = new Rectangle2D.Double(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(rectangle));
}
if (highlightQuads.size() > 0) {
TextMarkup _textMarkup = new TextMarkup(pagesOfNewFile.get(lastFoundNewFilePage).getPage(), highlightQuads,"", MarkupTypeEnum.Highlight);
_textMarkup.setColor(DeviceRGBColor.get(Color.GREEN));
_textMarkup.setVisible(true);
allTextMarkUps.add(_textMarkup);
}
Here is an example file Example
Thank You !!
Your code is not really self contained (I cannot run it as it in particular misses the input data), so I could only do a bit of PDF Clown code analysis. That code analysis, though, did indeed turn up a PDF Clown implementation detail that would explain your observation.
How does PDF Clown calculate the dimensions of the markup annotation?
The markup annotation rectangle must be big enough to include all quads plus start and end decorations (rounded left and right caps on markup rectangle).
PDF Clown calculates this rectangle as follows in TextMarkup:
public void setMarkupBoxes(
List<Quad> value
)
{
PdfArray quadPointsObject = new PdfArray();
double pageHeight = getPage().getBox().getHeight();
Rectangle2D box = null;
for(Quad markupBox : value)
{
/*
NOTE: Despite the spec prescription, Point 3 and Point 4 MUST be inverted.
*/
Point2D[] markupBoxPoints = markupBox.getPoints();
quadPointsObject.add(PdfReal.get(markupBoxPoints[0].getX())); // x1.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[0].getY())); // y1.
quadPointsObject.add(PdfReal.get(markupBoxPoints[1].getX())); // x2.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[1].getY())); // y2.
quadPointsObject.add(PdfReal.get(markupBoxPoints[3].getX())); // x4.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[3].getY())); // y4.
quadPointsObject.add(PdfReal.get(markupBoxPoints[2].getX())); // x3.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[2].getY())); // y3.
if(box == null)
{box = markupBox.getBounds2D();}
else
{box.add(markupBox.getBounds2D());}
}
getBaseDataObject().put(PdfName.QuadPoints, quadPointsObject);
/*
NOTE: Box width is expanded to make room for end decorations (e.g. rounded highlight caps).
*/
double markupBoxMargin = getMarkupBoxMargin(box.getHeight());
box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
setBox(box);
refreshAppearance();
}
private static double getMarkupBoxMargin(
double boxHeight
)
{return boxHeight * .25;}
So it takes the bounding box of all the quads and adds left and right margins each as wide as a quarter of the height of this whole bounding box.
What is the result in your case?
While this added margin width is sensible if there is only a single quad, in case of your markup annotation which includes many quads on top of one another, this results in a giant, unnecessary margin.
How to improve the code?
As the added caps depend on the individual caps and not their combined bounding box, one can improve the code by using the maximum height of the individual quads instead of the height of the bounding box of all quads, e.g. like this:
Rectangle2D box = null;
double maxQuadHeight = 0;
for(Quad markupBox : value)
{
double quadHeight = markupBox.getBounds2D().getHeight();
if (quadHeight > maxQuadHeight)
maxQuadHeight = quadHeight;
...
}
...
double markupBoxMargin = getMarkupBoxMargin(maxQuadHeight);
box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
setBox(box);
If you don't want to patch PDF Clown for this, you can also execute this code (with minor adaptations) after constructing the TextMarkup _textMarkup to correct the precalculated annotation rectangle.
Is this fixing a PDF Clown error?
It is not an error as there is no need for the text markup annotation rectangle to be minimal; PDF Clown could also always use the whole crop box for each such annotation.
I would assume, though, that the author of the code wanted to calculate a somewhat minimal rectangle but only optimized for single line and so in a way did not live up to his own expectations...
Are there other problems in this code?
Yes. The text a markup annotation marks needs not be horizontal, it may be there at an angle, it could even be vertical. In such a case some margin would also be needed at the top and the bottom of the annotation rectangle, not (only) at the left and the right.