PDF Shrink causing change in Orientation

PDF Shrink causing change in Orientation - java

I am shrinking pdf using below code. Before shrinking PDF pages can be seen in Portrait, but after shrinking their orientation is changing to Landscape. When I print rotation of page before shrinking it is coming as 270 degree. What is causing page to rotate after shrinking? (The PDF which i am trying to shrink has old scanned images)
public void shrinkPDF(String strFilePath , String strFileName) throws Exception {
PdfReader reader = new PdfReader(strFilePath+"//"+strFileName);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(strFilePath+"//Shrink_"+strFileName));
int n = reader.getNumberOfPages();
for (int p = 1; p <= 1; p++) {
float offsetX = (reader.getPageSize(p).getWidth() * (1 - xPercentage)) / 2;
float offsetY = (reader.getPageSize(p).getHeight() * (1 - yPercentage)) / 2;
PdfDictionary page;
PdfArray crop;
PdfArray media;
page = reader.getPageN(p);
System.out.println("reader.getPateRoatation-->"+reader.getPageRotation(p));
media = page.getAsArray(PdfName.CROPBOX);
if (media == null) {
media = page.getAsArray(PdfName.MEDIABOX);
}
crop = new PdfArray();
crop.add(new PdfNumber(0));
crop.add(new PdfNumber(0));
crop.add(new PdfNumber(media.getAsNumber(2).floatValue()));
crop.add(new PdfNumber(media.getAsNumber(3).floatValue()));
page.put(PdfName.MEDIABOX, crop);
page.put(PdfName.CROPBOX, crop);
Rectangle mediabox = reader.getPageSize(p);
stamper.getUnderContent(p).setLiteral(
String.format("\nq %s %s %s %s %s %s cm\nq\n",
xPercentage, mediabox.getLeft(),mediabox.getBottom(), yPercentage, offsetX, offsetY));
stamper.getOverContent(p).setLiteral("\nQ\nQ\n");
}
stamper.close();
reader.close();
}

The cause
The cause for the issue is a feature of iText:
iText tries to simplify adding information to a rotated page by starting both the overcontent and the undercontent with a rotation of the current transformation matrix. This makes additions to the page appear upright in a PDF viewer without the need to add individual rotations.
Even though the undercontent is drawn before the original page content, this normally has no effect on that original content because the whole undercontent is enveloped in a save-graphics-state / restore-graphics-state instruction pair.
The literal you use as undercontent, though, contains two save-graphics-state instructions and no restore-graphics-state instruction. This makes the added rotation suddenly affect the original content, too. Thus, your original content is rotated even though you only want to scale.
The fix
iText allows you to switch off the feature described above. You can do so by setting the PdfStamper property RotateContents to false right after creating the PdfStamper:
PdfStamper stamper = new PdfStamper(reader, result);
stamper.setRotateContents(false);
int n = reader.getNumberOfPages();
Now iText won't add that rotation anymore to the undercontent and your original only is scaled.
The PdfStamper property RotateContents has been discussed more deeply in this answer.
Annotation considerations
iText does not only add the rotation to undercontent and overcontent of page content streams, it also manipulates the dimensions of annotations added to rotated pages, and unfortunately the PdfStamper property RotateContents is not taken into account for that.
A work-around in that case is to temporarily remove the page Rotation entry before adding the annotation to the page and to put it back again later. This has already been discussed in more detail in this answer, this answer, and this answer.
Your remaining code
Your changes to crop box and media box seem unnecessary and might have unexpected and undesired results.
You add the shrinking like this:
stamper.getUnderContent(p).setLiteral(
String.format("\nq %s %s %s %s %s %s cm\nq\n",
xPercentage, mediabox.getLeft(),mediabox.getBottom(), yPercentage, offsetX, offsetY));
Setting the second and third parameter to mediabox.getLeft() and mediabox.getBottom() respectively often will have no bad effect (as these values often are 0) but in some cases you'll experience extremely distorted views of (enlarged parts of) your page.

Related

Getting DPI of PDPage/PDDocument to calculate PDF Dimensions Accurately

I'm looking to get an accurate size of each page in a PDF as part of a Unit test of PDF's I'll be creating. As I'm dealing with PDFs that have many different page sizes in each document the code returns an ArrayList of dimensions.
AFAIK each page can have its own DPI setting too.
I've done quite a bit of Googling but I've only come up with this which only gives me part of the answer, as I still need to work out what DPI each page is.
PDFBox - find page dimensions
public static ArrayList<float[]> getDimentions(PDDocument document) {
ArrayList<float[]> dimensions = new ArrayList<>();
float[] dim = new float[2];
//Loop Round Each Page
//Get Dimensions of each page and DPI
for (int i = 0; i < document.getNumberOfPages(); i++) {
PDPage currentPage = document.getPage(i);
PDRectangle mediaBox = currentPage.getMediaBox();
float height = mediaBox.getHeight();
float width = mediaBox.getWidth();
// How do I get the DPI now????
}
//Calculate Size of Page in mm (
//Get Dimensions of each page and DPI ( https://stackoverflow.com/questions/20904191/pdfbox-find-page-dimensions/20905166 )
//Add size of page to list of page sizes
//Return list of page sizes.
return dimensions;
}

The page dimensions (media box / crop box) are given in default userspace units which in turn default to 1/72 inch. So simply divide the box width or height by 72 to get the page width or height in inch.
This does not correspond to a DPI value of 72, though, because that would be a resolution value and a pdf does not have a resolution, the default userspace units merely are a different measurement unit.

PDFClown: Creating a TextMarkup leads to an inaccurate Box of the TextMarkup

Im working with PDFClown to analyze and work with PDFDocuments. My aim is to highlight all numbers within a table. For all numbers which belong together (For example: All numbers in one column of a table) I will create one TextMarkup with a List of Quads. First of all it looks like everythink work well: All highlights on the left belong to one TextMarkup and all Highlights on the right belong to another TextMarkup.
But when analyzing the size of the TextMarkup the size is bigger than it looks at the picture. So when drawing for example a rectangle arround the left TextMarkup box the rectangle intersects the other column despite no highlight of the left TextMarkup intersects the other column. Is there a way to optimize the Box of the TextMarkup? I think there is a bulbous ending of the box so that the box is intersecting the other TextMarkup
This is the code which creates the TextMarkup:
List<Quad> highlightQuads = new ArrayList<Quad>();
for (TextMarkup textMarkup : textMarkupsForOneAnnotation) {
Rectangle2D textBox = textMarkup.getBox();
Rectangle2D.Double rectangle = new Rectangle2D.Double(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(rectangle));
}
if (highlightQuads.size() > 0) {
TextMarkup _textMarkup = new TextMarkup(pagesOfNewFile.get(lastFoundNewFilePage).getPage(), highlightQuads,"", MarkupTypeEnum.Highlight);
_textMarkup.setColor(DeviceRGBColor.get(Color.GREEN));
_textMarkup.setVisible(true);
allTextMarkUps.add(_textMarkup);
}
Here is an example file Example
Thank You !!

Your code is not really self contained (I cannot run it as it in particular misses the input data), so I could only do a bit of PDF Clown code analysis. That code analysis, though, did indeed turn up a PDF Clown implementation detail that would explain your observation.
How does PDF Clown calculate the dimensions of the markup annotation?
The markup annotation rectangle must be big enough to include all quads plus start and end decorations (rounded left and right caps on markup rectangle).
PDF Clown calculates this rectangle as follows in TextMarkup:
public void setMarkupBoxes(
List<Quad> value
)
{
PdfArray quadPointsObject = new PdfArray();
double pageHeight = getPage().getBox().getHeight();
Rectangle2D box = null;
for(Quad markupBox : value)
{
/*
NOTE: Despite the spec prescription, Point 3 and Point 4 MUST be inverted.
*/
Point2D[] markupBoxPoints = markupBox.getPoints();
quadPointsObject.add(PdfReal.get(markupBoxPoints[0].getX())); // x1.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[0].getY())); // y1.
quadPointsObject.add(PdfReal.get(markupBoxPoints[1].getX())); // x2.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[1].getY())); // y2.
quadPointsObject.add(PdfReal.get(markupBoxPoints[3].getX())); // x4.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[3].getY())); // y4.
quadPointsObject.add(PdfReal.get(markupBoxPoints[2].getX())); // x3.
quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[2].getY())); // y3.
if(box == null)
{box = markupBox.getBounds2D();}
else
{box.add(markupBox.getBounds2D());}
}
getBaseDataObject().put(PdfName.QuadPoints, quadPointsObject);
/*
NOTE: Box width is expanded to make room for end decorations (e.g. rounded highlight caps).
*/
double markupBoxMargin = getMarkupBoxMargin(box.getHeight());
box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
setBox(box);
refreshAppearance();
}
private static double getMarkupBoxMargin(
double boxHeight
)
{return boxHeight * .25;}
So it takes the bounding box of all the quads and adds left and right margins each as wide as a quarter of the height of this whole bounding box.
What is the result in your case?
While this added margin width is sensible if there is only a single quad, in case of your markup annotation which includes many quads on top of one another, this results in a giant, unnecessary margin.
How to improve the code?
As the added caps depend on the individual caps and not their combined bounding box, one can improve the code by using the maximum height of the individual quads instead of the height of the bounding box of all quads, e.g. like this:
Rectangle2D box = null;
double maxQuadHeight = 0;
for(Quad markupBox : value)
{
double quadHeight = markupBox.getBounds2D().getHeight();
if (quadHeight > maxQuadHeight)
maxQuadHeight = quadHeight;
...
}
...
double markupBoxMargin = getMarkupBoxMargin(maxQuadHeight);
box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
setBox(box);
If you don't want to patch PDF Clown for this, you can also execute this code (with minor adaptations) after constructing the TextMarkup _textMarkup to correct the precalculated annotation rectangle.
Is this fixing a PDF Clown error?
It is not an error as there is no need for the text markup annotation rectangle to be minimal; PDF Clown could also always use the whole crop box for each such annotation.
I would assume, though, that the author of the code wanted to calculate a somewhat minimal rectangle but only optimized for single line and so in a way did not live up to his own expectations...
Are there other problems in this code?
Yes. The text a markup annotation marks needs not be horizontal, it may be there at an angle, it could even be vertical. In such a case some margin would also be needed at the top and the bottom of the annotation rectangle, not (only) at the left and the right.

Adding footer to existing PDF

I am trying to add footer to my existing PDF. I did add one footer to the PDF.
Is there anyway to add 2 lines of footer? This is my code below:
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(new File("D:/TestDestination/Merge Output1.pdf")));
document.open();
PdfReader reader1 = new PdfReader("D:/TestDestination/Merge Output.pdf");
int n1 = reader1.getNumberOfPages();
PdfImportedPage page;
PdfCopy.PageStamp stamp;
Font ffont = new Font(Font.FontFamily.UNDEFINED, 5, Font.ITALIC);
for (int i = 0; i < n1; ) {
page = copy.getImportedPage(reader1, ++i);
stamp = copy.createPageStamp(page);
ColumnText.showTextAligned(stamp.getUnderContent(), Element.ALIGN_CENTER,new Phrase(String.format("page %d of %d", i, n1)),297.5f, 28, 0);
stamp.alterContents();
copy.addPage(page);
}
document.close();
reader1.close();

Please go to the official documentation and click Q&A to go to the Frequently Asked Questions. Select Absolute positioning of text.
You are currently using ColumnText in a way that allows you to add a single line of text. You are using ColumnText.showTextAligned(...) as explained in my answer to the question How to rotate a single line of text?
You should read the answers to questions such as:
How to add text at an absolute position on the top of the first page?
How to add text inside a rectangle?
How to truncate text within a bounding box?
How to fit a String inside a rectangle?
How to reduce redundant code when adding content at absolute positions?
Assuming that you don't have access to the official web site (otherwise you wouldn't have posted your question), I'm adding a short code snippet:
ColumnText ct = new ColumnText(stamp.getUnderContent());
ct.setSimpleColumn(rectangle);
ct.addElement(new Paragraph("Whatever text needs to fit inside the rectangle"));
ct.go();
In this snippet, stamp is the object you created in your code. The rectangle object is of type Rectangle. Its parameters are the coordinates of the lower-left and upper-right corner of the rectangle in which you want to render the multi-line text.
Caveat: all text that doesn't fit the rectangle will be dropped. You can avoid this by adding the text in simulation mode first. If the text fits, add it for real. If it doesn't fit, try anew using a smaller font or a bigger rectangle.

Java: Apache PDFbox Extract highlighted text

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose.
So far I have done extraction of text from comments using below code.
PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Total annotations = " + la.size());
System.out.println("\nProcess Page " + pageNum + "...");
// Just get the first annotation for testing
PDAnnotation pdfAnnot = la.get(0);
System.out.println("Getting text from comment = " + pdfAnnot.getContents());
Now I need to get the highlighted text, any code example will be highly appreciated.

I Hope this answer help everyone who is facing the same problem.
// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
#SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
ArrayList<String> highlightedTexts = new ArrayList<>();
// this is the in-memory representation of the PDF document.
// this will load a document from a file.
PDDocument document = PDDocument.load(filePath);
// this represents all pages in a PDF document.
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
// this represents a single page in a PDF document.
PDPage page = allPages.get(pageNumber);
// get annotation dictionaries
List<PDAnnotation> annotations = page.getAnnotations();
for(int i=0; i<annotations.size(); i++) {
// check subType
if(annotations.get(i).getSubtype().equals("Highlight")) {
// extract highlighted text
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
String str = null;
for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {
COSFloat ULX = (COSFloat) quadsArray.get(0+k);
COSFloat ULY = (COSFloat) quadsArray.get(1+k);
COSFloat URX = (COSFloat) quadsArray.get(2+k);
COSFloat URY = (COSFloat) quadsArray.get(3+k);
COSFloat LLX = (COSFloat) quadsArray.get(4+k);
COSFloat LLY = (COSFloat) quadsArray.get(5+k);
COSFloat LRX = (COSFloat) quadsArray.get(6+k);
COSFloat LRY = (COSFloat) quadsArray.get(7+k);
k+=8;
float ulx = ULX.floatValue() - 1; // upper left x.
float uly = ULY.floatValue(); // upper left y.
float width = URX.floatValue() - LLX.floatValue(); // calculated by upperRightX - lowerLeftX.
float height = URY.floatValue() - LLY.floatValue(); // calculated by upperRightY - lowerLeftY.
PDRectangle pageSize = page.getMediaBox();
uly = pageSize.getHeight() - uly;
Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
stripperByArea.addRegion("highlightedRegion", rectangle_2);
stripperByArea.extractRegions(page);
String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");
if(j > 1) {
str = str.concat(highlightedText);
} else {
str = highlightedText;
}
}
highlightedTexts.add(str);
}
}
document.close();
return highlightedTexts;
}

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.
Having studied this code, the OP still wondered in a comment:
But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?
In general the area an annotation refers to is a rectangle:
Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.
(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)
For some annotations types (e.g. text markups), this location value does not suffice because:
text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.
To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:
QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order
x1 y1 x2 y2 x3 y3 x4 y4
specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).
(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)
Thus, instead of the rectangle given by
PDRectangle rect = pdfAnnot.getRectangle();
in the code in the referenced question, you have to consider the quadrilaterals given by
COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));
and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.
PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

How to add text to an image?

In my project I use iText to generate a PDF document.
Suppose that the height of a page measures 500pt (1 user unit = 1 point), and that I write some text to the page, followed by an image.
If the content and the image require less than 450pt, the text preceded the image.
If the content and the image exceed 450pt, the text is forwarded to the next page.
My question is: how can I obtain the remaining available space before writing an image?

First things first: when adding text and images to a page, iText sometimes changes the order of the textual content and the image. You can avoid this by using:
writer.setStrictImageSequence(true);
If you want to know the current position of the "cursor", you can use the method getVerticalPosition(). Unfortunately, this method isn't very elegant: it requires a Boolean parameter that will add a newline (if true) or give you the position at the current line (if false).
I do not understand why you want to get the vertical position. Is it because you want to have a caption followed by an image, and you want the caption and the image to be at the same page?
In that case, you could put your text and images inside a table cell and instruct iText not to split rows. In this case, iText will forward both text and image, in the correct order to the next page if the content doesn't fit the current page.
Update:
Based on the extra information added in the comments, it is now clear that the OP wants to add images that are watermarked.
There are two approaches to achieve this, depending on the actual requirement.
Approach 1:
The first approach is explained in the WatermarkedImages1 example. In this example, we create a PdfTemplate to which we add an image as well as some text written on top of that image. We can then wrap this PdfTemplate inside an image and add that image together with its watermark using a single document.add() statement.
This is the method that performs all the magic:
public Image getWatermarkedImage(PdfContentByte cb, Image img, String watermark) throws DocumentException {
float width = img.getScaledWidth();
float height = img.getScaledHeight();
PdfTemplate template = cb.createTemplate(width, height);
template.addImage(img, width, 0, 0, height, 0, 0);
ColumnText.showTextAligned(template, Element.ALIGN_CENTER,
new Phrase(watermark, FONT), width / 2, height / 2, 30);
return Image.getInstance(template);
}
This is how we add the images:
PdfContentByte cb = writer.getDirectContentUnder();
document.add(getWatermarkedImage(cb, Image.getInstance(IMAGE1), "Bruno"));
document.add(getWatermarkedImage(cb, Image.getInstance(IMAGE2), "Dog"));
document.add(getWatermarkedImage(cb, Image.getInstance(IMAGE3), "Fox"));
Image img = Image.getInstance(IMAGE4);
img.scaleToFit(400, 700);
document.add(getWatermarkedImage(cb, img, "Bruno and Ingeborg"));
As you can see, we have one very large image (a picture of my wife and me). We need to scale this image so that it fits the page. If you want to avoid this, take a look at the second approach.
Approach 2:
The second approach is explained in the WatermarkedImages2 example. In this case, we add each image to a PdfPCell. This PdfPCell will scale the image so that it fits the width of the page. To add the watermark, we use a cell event:
class WatermarkedCell implements PdfPCellEvent {
String watermark;
public WatermarkedCell(String watermark) {
this.watermark = watermark;
}
public void cellLayout(PdfPCell cell, Rectangle position,
PdfContentByte[] canvases) {
PdfContentByte canvas = canvases[PdfPTable.TEXTCANVAS];
ColumnText.showTextAligned(canvas, Element.ALIGN_CENTER,
new Phrase(watermark, FONT),
(position.getLeft() + position.getRight()) / 2,
(position.getBottom() + position.getTop()) / 2, 30);
}
}
This cell event can be used like this:
PdfPCell cell;
cell = new PdfPCell(Image.getInstance(IMAGE1), true);
cell.setCellEvent(new WatermarkedCell("Bruno"));
table.addCell(cell);
cell = new PdfPCell(Image.getInstance(IMAGE2), true);
cell.setCellEvent(new WatermarkedCell("Dog"));
table.addCell(cell);
cell = new PdfPCell(Image.getInstance(IMAGE3), true);
cell.setCellEvent(new WatermarkedCell("Fox"));
table.addCell(cell);
cell = new PdfPCell(Image.getInstance(IMAGE4), true);
cell.setCellEvent(new WatermarkedCell("Bruno and Ingeborg"));
table.addCell(cell);
You will use this approach if all images have more or less the same size, and if you don't want to worry about fitting the images on the page.
Consideration:
Obviously, both approaches have a different result because of the design choice that is made. Please compare the resulting PDFs to see the difference: watermark_template.pdf versus watermark_table.pdf

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.