Extract text from specific position in java - java

i wanna extract specific text from pdf i have the exactly position of the text
i try to use itext7 for the extraction but when i create the rectangle for the extraction with the correct dimension it seems too big for match the text but the dimension is correct i tried SimpleTextExtractionStrategy and
LocationTextExtractionStrategy same result
pdfFile
private void estraiValori(PdfPage page) {
for (Entry<String, Elemento> entry : templateMap.entrySet()) {
String key = entry.getKey();
Elemento value=(Elemento) entry.getValue();
//Rectangle tmp=new Rectangle((float)238.64,(float) 14.8,(float) 122,(float) 28.7);
TextRegionEventFilter fontFilter = new TextRegionEventFilter(value.getDim()); //getDim is a rectangle
FilteredEventListener listener = new FilteredEventListener();
//LocationTextExtractionStrategy extractionStrategy = listener.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
SimpleTextExtractionStrategy extractionStrategy = listener.attachEventListener(new SimpleTextExtractionStrategy(), fontFilter);
new PdfCanvasProcessor(listener).processPageContent(page);//page is a PdfPage
String actualText = extractionStrategy.getResultantText();
System.out.println(actualText);
}
}

There are multiple ways to show (visually) same content in PDF. You can append text glyph by glyph, or in whole sentences. TextRegionEventFilter does not split bigger chunks of text into smaller ones before filtering. If text was written in a big chunk and you want only a part of it, the raw content needs to be preprocessed, i.e. split into smaller chunks.
Fortunately, iText provides an out of the box way to do that - the class is called GlyphTextEventListener and it can be chained to the other ITextExtractionStrategy instances. Just wrap your listener into ITextExtractionStrategy in the following way:
TextRegionEventFilter filter = new TextRegionEventFilter(new Rectangle(x1, y1, x2, y2));
ITextExtractionStrategy filteredListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
ITextExtractionStrategy fineGrainedListener = new GlyphTextEventListener(filteredListener);
new PdfCanvasProcessor(fineGrainedListener).processPageContent(page);

Related

How to use iText to parse paths (such as lines in the document)

I am using iText to parse text in a PDF document, and i am using PdfContentStreamProcessor with a RenderListener. Such as:
PdfReader reader = new PdfReader(file.toURI().toURL());
int numberOfPages = reader.getNumberOfPages();
MyRenderListener listener = new MyRenderListener ();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
PdfDictionary pageDic = reader.getPageN(pageNumber);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
Rectangle pageSize = reader.getPageSize(pageNumber);
listener.startPage(pageNumber, pageSize);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
}
I have no problem to get the text with the renderText(TextRenderInfo) method, but how do I parse the graphic content appart from images? For example in my case I would like to get:
Text content which is in a box
Horizontal lines
Per mkl comment, by using ExtRenderListener I am able to get the geometries. I used How to extract the color of a rectangle in a PDF, with iText for reference

How to Maintain Aspect Ratio when Adding an Image to a PDFButtonFormField using iText7

I am collecting images from users with a form field on a PDF form. When the field is empty the user can populated the field in Acrobat and I can successfully read it from the form using iText7. If the user has previously uploaded an image, I want to present them with that image alreadyloaded into the form field and allow them to select and submit a different image. iText allows me to populate the form with the image but I distorts the image's aspect ration by resizing it to the dimensions of the form field's rectangle.
Is there a way to get iText's setImage() method to maintain the aspect ratio when loading the image?
I have also tried using the following code to modify the form field's rectangle to conform to the image aspect ratio before loading the image:
PdfReader reader = new PdfReader("TestForm.pdf");
ByteArrayOutputStream os = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(os);
StampingProperties properties = new StampingProperties();
properties.useAppendMode();
PdfDocument document = new PdfDocument(reader, writer, properties);
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
acroForm.setNeedAppearances(true);
// get button form field
String fieldName = "Image1_af_image";
PdfButtonFormField field = (PdfButtonFormField)acroForm.getField(fieldName);
// retrieve widget rectangle
PdfDictionary widgets = field.getWidgets().get(0).getPdfObject();
com.itextpdf.kernel.geom.Rectangle rect = widgets.getAsRectangle(PdfName.Rect);
// modify its width
field.setImage("/Users/sschultz/Desktop/zuni logo.jpg").setFieldName(fieldName);
document.close();
os.flush();
os.close();
FileUtils.writeByteArrayToFile(new File("TestForm_out.pdf"), os.toByteArray());
but this code fails to modify the form field's original dimensions.
Finally, I have attempted to add a second, new form field with the appropriate aspect ratio:
// add second button field to form
String fieldName2 = "Image2_af_image";
PdfButtonFormField imageField = PdfFormField.createButton(document, new Rectangle(10, 10, 200, 50),
PdfButtonFormField.FF_PUSH_BUTTON);
imageField.setImage("image2.jpg").setFieldName(fieldName2);
acroForm.addField(imageField);
but the second field never appears in the form.
The default behavior of drawing button field appearance that doesn't preserve image's aspect ratio is quite complicated to override. Instead I can suggest to generate appearance manually, making it exactly as you want it.
// get widget dictionary
List<PdfWidgetAnnotation> widgets = pushButtonField.getWidgets();
PdfDictionary widgetDict;
if (widgets.size() > 0) {
widgetDict = widgets.get(0).getPdfObject();
} else {
// widgets.size() == 0 shouldn't really happen to properly created
// existing fields, but let's do it just in case
widgetDict = pushButtonField.getPdfObject();
}
Rectangle origRect = widgetDict.getAsRectangle(PdfName.Rect);
float borderWidth = pushButtonField.getBorderWidth();
String imgPath = ... // path to image file
// draw custom appearance preserving original field sizes
PdfFormXObject pdfFormXObject = new PdfFormXObject(new Rectangle(origRect.getWidth() - borderWidth * 2, origRect.getHeight() - borderWidth * 2));
Canvas canvas = new Canvas(pdfFormXObject, pdfDoc);
// Image class preserves aspect ratio by default
Image image = new Image(ImageDataFactory.create(imgPath))
.setAutoScale(true)
.setHorizontalAlignment(HorizontalAlignment.CENTER);
Div container = new Div()
.setMargin(borderWidth).add(image)
.setVerticalAlignment(VerticalAlignment.MIDDLE)
.setFillAvailableArea(true);
canvas.add(container);
canvas.close();
// override original appearance with new one
PdfDictionary apDict = new PdfDictionary();
widgetDict.put(PdfName.AP, apDict);
apDict.put(PdfName.N, pdfFormXObject.getPdfObject());
// mark widgetDict as modified in order to save its changes in append mode
widgetDict.setModified();
The last line (calling setModified()) is required to be done for all low level PdfObjects modifications if you are using properties.useAppendMode();. Otherwise the change will be not saved.

Add footer text and page number in same line on docx file with docx4j java

I am trying to add small text (left side) and page number (right side) on the footer of a .docx document in the same line
so far I can add the text and the page number but in 2 lines
TextVersionv02312
1
But I need it
TextVersionv02312 1
The code that I am using to add text and page number is:
private static Ftr createFooter(WordprocessingMLPackage wordMLPackage, String content, ObjectFactory factory, Part sourcePart, InputStream is) throws IOException, Throwable {
Ftr footer = factory.createFtr();
P paragraph = factory.createP();
R run = factory.createR();
/*
* Change the font size to 8 points(the font size is defined to be in half-point
* size so set the value as 16).
*/
RPr rpr = new RPr();
HpsMeasure size = new HpsMeasure();
size.setVal(BigInteger.valueOf(16));
rpr.setSz(size);
run.setRPr(rpr);
Text text = new Text();
text.setValue(content);
run.getContent().add(text);
paragraph.getContent().add(run);
footer.getContent().add(paragraph);
// add page number
P pageNumParagraph = factory.createP();
addFieldBegin(factory, pageNumParagraph);
addPageNumberField(factory, pageNumParagraph);
addFieldEnd(factory, pageNumParagraph);
footer.getContent().add(pageNumParagraph);
return footer;
}
private static void addPageNumberField(ObjectFactory factory, P paragraph) {
R run = factory.createR();
PPr ppr = new PPr();
Jc jc = new Jc();
jc.setVal(JcEnumeration.RIGHT);
ppr.setJc(jc);
paragraph.setPPr(ppr);
Text txt = new Text();
txt.setSpace("preserve");
txt.setValue(" PAGE \\* MERGEFORMAT ");
run.getContent().add(factory.createRInstrText(txt));
paragraph.getContent().add(run);
}
I have been thinking to add a table or something like that on the footer to put the elements in the same line, But it seems that I am overcomplicating the stuff.
or maybe I can append the page number to the text paragraph
what do you think?
thanks in advance!
You can do it any way you can in Word, for example, with tab stops. (Or as you say, tables, but I'd use say a centre aligned then a right aligned tab if you wanted centre then right)
Easiest is to get it right in Word, then use the docx4j webapp or Docx4j Helper Word AddIn, to generate corresponding Java code from that sample document.

Is it possible to mark a string in pdf document?

I was wondering if it was possible to mark strings in pdf with different color or underline them while looping through the pdf document ?
It's possible on creating a document. Just use different chunks to set the style. Here's an example:
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.add(new Chunk("This word is "));
Chunk underlined = new Chunk("underlined");
underlined.setUnderline(1.0f, -1.0f); //We can customize thickness and position of underline
document.add(underlined);
document.add(new Chunk(". And this phrase has "));
Chunk background = new Chunk("yellow background.");
background.setBackground(BaseColor.YELLOW);
document.add(background);
document.add(Chunk.NEWLINE);
document.close();
However, it's almost impossible to edit an existing PDF document. The author of iText writes in his book:
In a PDF document, every character or glyph on a PDF page has its
fixed position, regardless of the application that’s used to view the
document. This is an advantage, but it also comes with a disadvantage.
Suppose you want to replace the word “edit” with the word “manipulate”
in a sentence, you’d have to reflow the text. You’d have to reposition
all the characters that follow that word. Maybe you’d even have to
move a portion of the text to the next page. That’s not trivial, if
not impossible.
If you want to “edit” a PDF, it’s advised that you change the original
source of the document and remake the PDF.
Aspose.PDF APIs support to create new PDF document and manipulate existing PDF documents without Adobe Acrobat dependency. You can search and add Highlight Annotation to mark PDF text.
REST API Solution using Aspose.PDF Cloud SDK for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf-cloud/aspose-pdf-cloud-java
String name = "02_pages.pdf";
String folder="Temp";
String remotePath=folder+"/"+name;
// File to upload
File file = new File("C:/Temp/"+name);
// Storage name is default storage
String storage = null;
// Get App Key and App SID from https://dashboard.aspose.cloud/
PdfApi pdfApi = new PdfApi("xxxxxxxxxxxxxxxxxxxx", "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx");
//Upload file to cloud storage
pdfApi.uploadFile(remotePath,file,storage);
//Text Position
Rectangle rect = new Rectangle().LLX(259.27580539703365).LLY(743.4707997894287).URX(332.26148873138425).URY(765.5148007965088);
List<AnnotationFlags> flags = new ArrayList<>();
flags.add(AnnotationFlags.DEFAULT);
HighlightAnnotation annotation = new HighlightAnnotation();
annotation.setName("Name Updated");
annotation.rect(rect);
annotation.setFlags(flags);
annotation.setHorizontalAlignment(HorizontalAlignment.CENTER);
annotation.setRichText("Rich Text Updated");
annotation.setSubject("Subj Updated");
annotation.setPageIndex(1);
annotation.setZindex(1);
annotation.setTitle("Title Updated");
annotation.setModified("02/02/2018 00:00:00.000 AM");
List<HighlightAnnotation> annotations = new ArrayList<>();
annotations.add(annotation);
//Add Highlight Annotation to the PDF document
AsposeResponse response = pdfApi.postPageHighlightAnnotations(name,1, annotations, storage, folder);
//Download annotated PDF file from Cloud Storage
File downloadResponse = pdfApi.downloadFile(remotePath, null, null);
File dest = new File("C:/Temp/HighlightAnnotation.pdf");
Files.copy(downloadResponse.toPath(), dest.toPath(), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
System.out.println("Completed......");
On-Premise Solution using Aspose.PDF for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// Instantiate Document object
Document document = new Document("C:/Temp/Test.pdf");
// Create TextFragment Absorber instance to search particular text fragment
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Estoque");
// Iterate through pages of PDF document
for (int i = 1; i <= document.getPages().size(); i++) {
// Get first page of PDF document
Page page = document.getPages().get_Item(i);
page.accept(textFragmentAbsorber);
}
// Create a collection of Absorbed text
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Iterate on above collection
for (int j = 1; j <= textFragmentCollection.size(); j++) {
TextFragment textFragment = textFragmentCollection.get_Item(j);
// Get rectangular dimensions of TextFragment object
Rectangle rect = new Rectangle((float) textFragment.getPosition().getXIndent(), (float) textFragment.getPosition().getYIndent(), (float) textFragment.getPosition().getXIndent() + (float) textFragment.getRectangle().getWidth(), (float) textFragment.getPosition().getYIndent() + (float) textFragment.getRectangle().getHeight());
// Instantiate HighLight Annotation instance
HighlightAnnotation highLight = new HighlightAnnotation(textFragment.getPage(), rect);
// Set opacity for annotation
highLight.setOpacity(.80);
// Set the color of annotation
highLight.setColor(Color.getYellow());
// Add annotation to annotations collection of TextFragment
textFragment.getPage().getAnnotations().add(highLight);
}
// Save updated document
document.save("C:/Temp/HighLight.pdf");
P.S: I work as support/evangelist developer at Aspose.

iText reading multicolumned PDF document

Reading multicolumned PDF document
When iText read the PDF (Extract a page content into a string variable), then the content would be fixed there by:
reader = new PdfReader(getResources().openRawResource(R.raw.resume1));
original_content = PdfTextExtractor.getTextFromPage(reader, 2);
String sub_content = original_content.trim().replaceAll(" {2,}", " ");
sub_content = sub_content.trim().replaceAll("\n ", "\n");
sub_content = sub_content.replaceAll("(.+)(?<!\\.)\n(?!\\W)", "$1 ");
if the document is 1 column only but if the document has multicolumn, it would extract the document 1 per line. it would combine left and right column.
I am using this as a sample PDF this is from START QA document.
How to read a multicolumned PDF document?
There are two different approaches to this problem, and the choice which to use depends on the PDF itself.
If strings in the page content of the PDF in questions already are in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly use the SimpleTextExtractionStrategy; in your case:
original_content = PdfTextExtractor.getTextFromPage(reader, 2, new SimpleTextExtractionStrategy());
If the strings in the page content of the PDF in question are not in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly wrap one such strategy in a FilteredTextRenderListener restricting it to receive text for the area of a single column only; in your case:
Rectangle left = new Rectangle(0, 0, 306, 792);
Rectangle right = new Rectangle(306, 0, 612, 792);
RenderFilter leftFilter = new RegionTextRenderFilter(left);
RenderFilter rightFilter = new RegionTextRenderFilter(right);
[...]
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), leftFilter);
original_content = PdfTextExtractor.getTextFromPage(reader, 2, strategy);
originalContent += " ";
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), rightFilter);
original_content += PdfTextExtractor.getTextFromPage(reader, 2, strategy);

Categories