Get Images by rectangle - java

I have this method which can extract text in a specific location in the pdf
public static void getTextByRectangle(PDDocument doc,Rectangle rect) throws IOException{
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
stripper.addRegion( "class1", rect );
PDPage firstPage = doc.getPage(0);
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
}
Is it possible to do the same thing but for extracting images??

Yes, you can extract all images, and compare the position of rect and images. Here is the example by pdfbox. This can get image locations.
You need create a class extends PDFStreamEngine. Like this,
public class PrintImageLocations extends PDFStreamEngine
You should override processOperator. And from ctmNew, you can get the image location, then compare image with yours rect, you will get the right image.
#Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
String operation = operator.getName();
if ("Do".equals(operation)) {
COSName objectName = (COSName) operands.get(0);
PDXObject xobject = getResources().getXObject(objectName);
if (xobject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject) xobject;
Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix();
float imageXScale = ctmNew.getScalingFactorX();
float imageYScale = ctmNew.getScalingFactorY();
// position in user space units. 1 unit = 1/72 inch at 72 dpi
System.out.println("position in PDF = " + ctmNew.getTranslateX() + ", " + ctmNew.getTranslateY() + " in user space units");
// displayed size in user space units
System.out.println("displayed size = " + imageXScale + ", " + imageYScale + " in user space units");
} else if (xobject instanceof PDFormXObject) {
PDFormXObject form = (PDFormXObject) xobject;
showForm(form);
}
} else {
super.processOperator(operator, operands);
}
}
Thanks mkl and FiReTiTi's advice.

Related

PDCheckbox is not checked for signature

I am looping over my fields to update them, I have textfields and checkboxes:
for (FieldInput formField : formFields) {
PDField field = acroForm.getField(formField.id);
if (field != null) {
if (field.getFieldType() == "Btn" && formField.value.equals("true")) {
((PDCheckBox) field).check();
} else {
field.setValue(formField.value);
}
field.setReadOnly(true);
field.getCOSObject().setNeedToBeUpdated(true);
if (field.getFieldType().equals("Tx")) {
field.getWidgets().get(0).getAppearance().getCOSObject().setNeedToBeUpdated(true);
}
Log.info("set field: " + field.getFullyQualifiedName() + " to " + formField.value);
}
}
But in adobe the checkbox is not filled (it is in my viewer in vs-code). When I tried to update the appearance I get an error that there is no widget for this the return value of "org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationWidget.getAppearance()" is null.
But according to my creation code there should be one:
PDField formField;
if (field.type.equals("checkbox")) {
formField = new PDCheckBox(form);
} else {
formField = new PDTextField(form);
}
formField.setPartialName(field.id);
if (!field.type.equals("checkbox")) {
String defaultAppearance = "/Helv " + (field.fontSize * height) + " Tf 0 0 0 rg";
((PDTextField) formField).setDefaultAppearance(defaultAppearance);
}
form.getFields().add(formField);
PDAnnotationWidget widget = formField.getWidgets().get(0);
Log.info(height - height * field.y);
PDRectangle rect = new PDRectangle(field.x * width, height - field.y * height - field.height * height,
field.width * width,
field.height * height);
widget.setRectangle(rect);
widget.setPage(page);
PDAppearanceCharacteristicsDictionary fieldAppearance = new PDAppearanceCharacteristicsDictionary(
new COSDictionary());
widget.setAppearanceCharacteristics(fieldAppearance);
widget.setPrinted(true);
page.getAnnotations().add(widget);

How to remove an image from an excel sheet (XSSF) in java

I have now been trying for too long to remove an image from my XSSFSheet. I cannot find any information about this, but I would think that it has to be possible..
Is there any way to remove an image from my XSSFSheet? Even the official (?) apache poi website does not mention anything besides creating and reading images
I am now not far away from giving up and just copying everything except said image into a new sheet. Which is obviously not how this should be done. I don't think I would be able to sleep well for a week if I did that.
My last unsuccessful attempt was to use my code which moves images (I shared that code in this post) but instead of setting valid row numbers I would set null, but that's not possible since the parameter for setRow() is int (primitive type).
Then I tried setting a negative value for the anchor rows. While this technically removes the images, the excel file has to be repaired when it is opened the next time. The images are not being displayed.
I believe I would have to remove the relation from the XSSFDrawing too to completely remove the image (I think this after finding this custom implementation of XSSFDrawing) but I have no idea what is going on there...
I would be grateful for any kind of help here!
For XSSF this is not as simple as it sounds. There is HSSFPatriarch.removeShape but there is not something comparable in XSSFDrawing.
We must delete the picture itself inclusive the relations. And we must delete the shape's anchor from the drawing.
Example which goes trough all pictures in a sheet and deletes a picture if it's shape name is "Image 2":
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.hssf.usermodel.*;
import org.apache.xmlbeans.XmlCursor;
import java.io.*;
class ExcelDeleteImage {
public static void deleteCTAnchor(XSSFPicture xssfPicture) {
XSSFDrawing drawing = xssfPicture.getDrawing();
XmlCursor cursor = xssfPicture.getCTPicture().newCursor();
cursor.toParent();
if (cursor.getObject() instanceof org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTTwoCellAnchor) {
for (int i = 0; i < drawing.getCTDrawing().getTwoCellAnchorList().size(); i++) {
if (cursor.getObject().equals(drawing.getCTDrawing().getTwoCellAnchorArray(i))) {
drawing.getCTDrawing().removeTwoCellAnchor(i);
System.out.println("TwoCellAnchor for picture " + xssfPicture + " was deleted.");
}
}
} else if (cursor.getObject() instanceof org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTOneCellAnchor) {
for (int i = 0; i < drawing.getCTDrawing().getOneCellAnchorList().size(); i++) {
if (cursor.getObject().equals(drawing.getCTDrawing().getOneCellAnchorArray(i))) {
drawing.getCTDrawing().removeOneCellAnchor(i);
System.out.println("OneCellAnchor for picture " + xssfPicture + " was deleted.");
}
}
} else if (cursor.getObject() instanceof org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTAbsoluteAnchor) {
for (int i = 0; i < drawing.getCTDrawing().getAbsoluteAnchorList().size(); i++) {
if (cursor.getObject().equals(drawing.getCTDrawing().getAbsoluteAnchorArray(i))) {
drawing.getCTDrawing().removeAbsoluteAnchor(i);
System.out.println("AbsoluteAnchor for picture " + xssfPicture + " was deleted.");
}
}
}
}
public static void deleteEmbeddedXSSFPicture(XSSFPicture xssfPicture) {
if (xssfPicture.getCTPicture().getBlipFill() != null) {
if (xssfPicture.getCTPicture().getBlipFill().getBlip() != null) {
if (xssfPicture.getCTPicture().getBlipFill().getBlip().getEmbed() != null) {
String rId = xssfPicture.getCTPicture().getBlipFill().getBlip().getEmbed();
XSSFDrawing drawing = xssfPicture.getDrawing();
drawing.getPackagePart().removeRelationship(rId);
drawing.getPackagePart().getPackage().deletePartRecursive(drawing.getRelationById(rId).getPackagePart().getPartName());
System.out.println("Picture " + xssfPicture + " was deleted.");
}
}
}
}
public static void deleteHSSFShape(HSSFShape shape) {
HSSFPatriarch drawing = shape.getPatriarch();
drawing.removeShape(shape);
System.out.println("Shape " + shape + " was deleted.");
}
public static void main(String[] args) throws Exception {
String filename = "ExcelWithImages.xlsx";
//String filename = "ExcelWithImages.xls";
InputStream inp = new FileInputStream(filename);
Workbook workbook = WorkbookFactory.create(inp);
Sheet sheet = workbook.getSheetAt(0);
Drawing drawing = sheet.getDrawingPatriarch();
XSSFPicture xssfPictureToDelete = null;
if (drawing instanceof XSSFDrawing) {
for (XSSFShape shape : ((XSSFDrawing)drawing).getShapes()) {
if (shape instanceof XSSFPicture) {
XSSFPicture xssfPicture = (XSSFPicture)shape;
String shapename = xssfPicture.getShapeName();
int row = xssfPicture.getClientAnchor().getRow1();
int col = xssfPicture.getClientAnchor().getCol1();
System.out.println("Picture " + "" + " with Shapename: " + shapename + " is located row: " + row + ", col: " + col);
if ("Image 2".equals(shapename)) xssfPictureToDelete = xssfPicture;
}
}
}
if (xssfPictureToDelete != null) deleteEmbeddedXSSFPicture(xssfPictureToDelete);
if (xssfPictureToDelete != null) deleteCTAnchor(xssfPictureToDelete);
HSSFPicture hssfPictureToDelete = null;
if (drawing instanceof HSSFPatriarch) {
for (HSSFShape shape : ((HSSFPatriarch)drawing).getChildren()) {
if (shape instanceof HSSFPicture) {
HSSFPicture hssfPicture = (HSSFPicture)shape;
int picIndex = hssfPicture.getPictureIndex();
String shapename = hssfPicture.getShapeName().trim();
int row = hssfPicture.getClientAnchor().getRow1();
int col = hssfPicture.getClientAnchor().getCol1();
System.out.println("Picture " + picIndex + " with Shapename: " + shapename + " is located row: " + row + ", col: " + col);
if ("Image 2".equals(shapename)) hssfPictureToDelete = hssfPicture;
}
}
}
if (hssfPictureToDelete != null) deleteHSSFShape(hssfPictureToDelete);
FileOutputStream out = new FileOutputStream(filename);
workbook.write(out);
out.close();
workbook.close();
}
}
This code is tested using apache poi 4.0.1 and it needs the full jar of all of the schemas ooxml-schemas-1.4.jar as mentioned in FAQ N10025

Detecting text field overflow

Assuming I have a PDF document with a text field with some font and size defined, is there a way to determine if some text will fit inside the field rectangle using PDFBox?
I'm trying to avoid cases where text is not fully displayed inside the field, so in case the text overflows given the font and size, I would like to change the font size to Auto (0).
This code recreates the appearance stream to be sure that it exists so that there is a bbox (which can be a little bit smaller than the rectangle).
public static void main(String[] args) throws IOException
{
// file can be found at https://issues.apache.org/jira/browse/PDFBOX-142
// https://issues.apache.org/jira/secure/attachment/12742551/Testformular1.pdf
try (PDDocument doc = PDDocument.load(new File("Testformular1.pdf")))
{
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDTextField field = (PDTextField) acroForm.getField("Name");
PDAnnotationWidget widget = field.getWidgets().get(0);
// force generation of appearance stream
field.setValue(field.getValue());
PDRectangle rectangle = widget.getRectangle();
PDAppearanceEntry ap = widget.getAppearance().getNormalAppearance();
PDAppearanceStream appearanceStream = ap.getAppearanceStream();
PDRectangle bbox = appearanceStream.getBBox();
float fieldWidth = Math.min(bbox.getWidth(), rectangle.getWidth());
String defaultAppearance = field.getDefaultAppearance();
System.out.println(defaultAppearance);
// Pattern must be improved, font may have numbers
// /Helv 12 Tf 0 g
final Pattern p = Pattern.compile("\\/([A-z]+) (\\d+).+");
Matcher m = p.matcher(defaultAppearance);
if (!m.find() || m.groupCount() != 2)
{
System.out.println("oh-oh");
System.exit(-1);
}
String fontName = m.group(1);
int fontSize = Integer.parseInt(m.group(2));
PDResources resources = appearanceStream.getResources();
if (resources == null)
{
resources = acroForm.getDefaultResources();
}
PDFont font = resources.getFont(COSName.getPDFName(fontName));
float stringWidth = font.getStringWidth("Tilman Hausherr Tilman Hausherr");
System.out.println("stringWidth: " + stringWidth * fontSize / 1000);
System.out.println("field width: " + fieldWidth);
}
}
The output is:
/Helv 12 Tf 0 g
stringWidth: 180.7207
field width: 169.29082

Extract unselectable content from PDF

I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.
Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!
Here is the code I'm using to convert PDF pages to images:
private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
for (PDPage pdPage : pdPages) {
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
}
document.close();
}
Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?
Much appreciated for any help!
EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf
The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.
Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.
PDFBox 1.8.10
You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
if (pages == null)
return;
for (int i = 0; i < pages.size(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}
(ExtractPatternImages.java)
I applied it to your sample PDF like this
public void testtestDrJorge() throws IOException
{
try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
{
PDDocument document = PDDocument.load(resource);
extractPatternImages(document, "testDrJorge%s.%s");;
}
}
(ExtractPatternImages.java)
and got two images:
`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png
The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.
PDFBox 2.0.0 release candidate
I updated the code to PDFBox 2.0.0 (currently available as release candidate only):
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
PDPageTree pages = document.getDocumentCatalog().getPages();
if (pages == null)
return;
for (int i = 0; i < pages.getCount(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
String filename = String.format(imageFormat, "", image.getSuffix());
ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}
and get
testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png
Looks like an improvement... ;)

How can I get Images coordinates in pdf into JSONfile?

I have coded creating html page included images extracting a page in pdf document.
I had tried to extract images from pdf and then I succeeded to extract images from pdf and to apply the images to html page using PDFBox lib. but I did not extract image coordinates in html page.
So searched how to extract image coordinates in pdf, I tried to extract image coordinates in pdf using PDFBox Library.
Below code :
public static void main(String[] args) throws Exception
{
try
{
PDDocument document = PDDocument.load(
"/Users/tmdtjq/Downloads/PDFTest/test.pdf" );
PrintImageLocations printer = new PrintImageLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for( int i=0; i<allPages.size(); i++ )
{
PDPage page = (PDPage)allPages.get( i );
int pageNum = i+1;
System.out.println( "Processing page: " + pageNum );
printer.processStream( page, page.findResources(),
page.getContents().getStream() );
}
}
finally
{
}
}
protected void processOperator( PDFOperator operator, List arguments ) throws IOException
{
String operation = operator.getOperation();
if( operation.equals( "Do" ) )
{
COSName objectName = (COSName)arguments.get( 0 );
Map xobjects = getResources().getXObjects();
PDXObject xobject = xobjects.get( objectName.getName() );
if( xobject instanceof PDXObjectImage )
{
try
{
PDXObjectImage image = (PDXObjectImage)xobject;
PDPage page = getCurrentPage();
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double rotationInRadians =(page.findRotation() * Math.PI)/180;
AffineTransform rotation = new AffineTransform();
rotation.setToRotation( rotationInRadians );
AffineTransform rotationInverse = rotation.createInverse();
Matrix rotationInverseMatrix = new Matrix();
rotationInverseMatrix.setFromAffineTransform( rotationInverse );
Matrix rotationMatrix = new Matrix();
rotationMatrix.setFromAffineTransform( rotation );
Matrix unrotatedCTM = ctm.multiply( rotationInverseMatrix );
float xScale = unrotatedCTM.getXScale();
float yScale = unrotatedCTM.getYScale();
float xPosition = unrotatedCTM.getXPosition();
float yPosition = unrotatedCTM.getYPosition();
System.out.println( "Found image[" + objectName.getName() + "] " +
"at " + xPosition + "," + yPosition +
" size=" + (xScale/100f*image.getWidth()) + "," + (yScale/100f*image.getHeight() ));
}
catch( NoninvertibleTransformException e )
{
throw new WrappedIOException( e );
}
}
}
}
Outputs printing X,Y Positions in images is All 0.0, 0.0.
I think because getGraphicsState() is method to return the graphicsState.
But I want to get specific images coordinates applied to height,width of a PDF page in order to create html page.
I think maybe it is solution to extract JSON from images coordinates in PDF.
Please introduce image coordinates in PDF to JSON tool or suggest PDF Library.
(Already I used pdf2json tool in FlexPaper. this tool extracts JSONfile including not images data but just texts data(content, coordinates, font..) from PDF page.)
I was able to find images with searching for cm operator.
I overrided PDFTextStripper the following way:
Note: it doesn't take into account rotation and mirroring!
public static class TextFinder extends PDFTextStripper {
public TextFinder() throws IOException {
super();
}
#Override
protected void startPage(PDPage page) throws IOException {
// process start of the page
super.startPage(page);
}
#Override
public void process(PDFOperator operator, List<COSBase> arguments)
throws IOException {
if ("cm".equals(operator.getOperation())) {
float width = ((COSNumber)arguments.get(0)).floatValue();
float height = ((COSNumber)arguments.get(3)).floatValue();
float x = ((COSNumber)arguments.get(4)).floatValue();
float y = ((COSNumber)arguments.get(5)).floatValue();
// process image coordinates
}
super.processOperator(operator, arguments);
}
#Override
protected void writeString(String text,
List<TextPosition> textPositions) throws IOException {
for (TextPosition position : textPositions) {
// process text coordinates
}
super.writeString(text, textPositions);
}
}
Of course, one can use PDFStreamEngine instead of PDFTextStripper, if one is not interested in finding text together with images.

Categories