iText reading multicolumned PDF document

iText reading multicolumned PDF document - java

Reading multicolumned PDF document
When iText read the PDF (Extract a page content into a string variable), then the content would be fixed there by:
reader = new PdfReader(getResources().openRawResource(R.raw.resume1));
original_content = PdfTextExtractor.getTextFromPage(reader, 2);
String sub_content = original_content.trim().replaceAll(" {2,}", " ");
sub_content = sub_content.trim().replaceAll("\n ", "\n");
sub_content = sub_content.replaceAll("(.+)(?<!\\.)\n(?!\\W)", "$1 ");
if the document is 1 column only but if the document has multicolumn, it would extract the document 1 per line. it would combine left and right column.
I am using this as a sample PDF this is from START QA document.
How to read a multicolumned PDF document?

There are two different approaches to this problem, and the choice which to use depends on the PDF itself.
If strings in the page content of the PDF in questions already are in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly use the SimpleTextExtractionStrategy; in your case:
original_content = PdfTextExtractor.getTextFromPage(reader, 2, new SimpleTextExtractionStrategy());
If the strings in the page content of the PDF in question are not in the desired order: Instead of the LocationTextExtractionStrategy implicitly used by the overload of PdfTextExtractor.getTextFromPage you use, explicitly wrap one such strategy in a FilteredTextRenderListener restricting it to receive text for the area of a single column only; in your case:
Rectangle left = new Rectangle(0, 0, 306, 792);
Rectangle right = new Rectangle(306, 0, 612, 792);
RenderFilter leftFilter = new RegionTextRenderFilter(left);
RenderFilter rightFilter = new RegionTextRenderFilter(right);
[...]
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), leftFilter);
original_content = PdfTextExtractor.getTextFromPage(reader, 2, strategy);
originalContent += " ";
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), rightFilter);
original_content += PdfTextExtractor.getTextFromPage(reader, 2, strategy);

Related

How to use iText to parse paths (such as lines in the document)

I am using iText to parse text in a PDF document, and i am using PdfContentStreamProcessor with a RenderListener. Such as:
PdfReader reader = new PdfReader(file.toURI().toURL());
int numberOfPages = reader.getNumberOfPages();
MyRenderListener listener = new MyRenderListener ();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
PdfDictionary pageDic = reader.getPageN(pageNumber);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
Rectangle pageSize = reader.getPageSize(pageNumber);
listener.startPage(pageNumber, pageSize);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
}
I have no problem to get the text with the renderText(TextRenderInfo) method, but how do I parse the graphic content appart from images? For example in my case I would like to get:
Text content which is in a box
Horizontal lines

Per mkl comment, by using ExtRenderListener I am able to get the geometries. I used How to extract the color of a rectangle in a PDF, with iText for reference

PDF Cut Vertically and Merge

I have a booklet pdf. I want to Split in half i.e Vertical + Re-paginate from booklet scan
Ex: booklet pages would be 1, 8 and 7, 2 etc.,
After processing i want to have a PDF with 1, 2, 3, 4, ....
Please advise which PDF library would be able to do the above in java
Thanks

Depending on how the booklet was scanned into the PDF, I think you might be able to do this using a Java library that can extract and merge PDF pages.
For example, in the LEADTOOLS Java PDF Library, which is what I am familiar with since I work for the vendor, there is a PDFFile class that can be used to extract and merge pages from and to a PDF file.
PDFFile file = new PDFFile(bookletFile);
int pageCount = file.getPageCount();
for (int i = 1; i <= pageCount; i++)
{
File destinationFile = new File(destinationFolder, String.format("Extracted_Page{0}.pdf", i));
file.extractPages(i, i, destinationFile.getPath());
}
Since the booklet looks like it’s scanned in a way that every other page will contain a double page. To split them, you can load every other extracted page as a raster image then use the library's raster imaging classes to save each half as a separate raster PDF:
RasterCodecs codecs = new RasterCodecs();
RasterImage firstHalfImage = codecs.load(extractedDoublePage);
// Create a LeadRect that encompasses the second half
LeadRect secondHalfLeadRect = new LeadRect(firstHalfImage.getImageWidth() / 2, 0, firstHalfImage.getImageWidth() / 2, firstHalfImage.getImageHeight());
// Create a new image containing the second half
RasterImage secondHalfImage = firstHalfImage.clone(secondHalfLeadRect);
// Crop First Image to contain only first half
LeadRect firstHalfLeadRect = new LeadRect(0, 0, firstHalfImage.getImageWidth() / 2, firstHalfImage.getImageHeight());
CropCommand cropCommand = new CropCommand(firstHalfLeadRect);
cropCommand.run(firstHalfImage);
You can then use the RasterCodecs.Save() method to save each image as a raster PDF file.
Finally, once you have split everything accordingly, you can use the PDFFile.MergeWith() method to combine all the pages back into one file in the needed order.

Extract text from specific position in java

i wanna extract specific text from pdf i have the exactly position of the text
i try to use itext7 for the extraction but when i create the rectangle for the extraction with the correct dimension it seems too big for match the text but the dimension is correct i tried SimpleTextExtractionStrategy and
LocationTextExtractionStrategy same result
pdfFile
private void estraiValori(PdfPage page) {
for (Entry<String, Elemento> entry : templateMap.entrySet()) {
String key = entry.getKey();
Elemento value=(Elemento) entry.getValue();
//Rectangle tmp=new Rectangle((float)238.64,(float) 14.8,(float) 122,(float) 28.7);
TextRegionEventFilter fontFilter = new TextRegionEventFilter(value.getDim()); //getDim is a rectangle
FilteredEventListener listener = new FilteredEventListener();
//LocationTextExtractionStrategy extractionStrategy = listener.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
SimpleTextExtractionStrategy extractionStrategy = listener.attachEventListener(new SimpleTextExtractionStrategy(), fontFilter);
new PdfCanvasProcessor(listener).processPageContent(page);//page is a PdfPage
String actualText = extractionStrategy.getResultantText();
System.out.println(actualText);
}
}

There are multiple ways to show (visually) same content in PDF. You can append text glyph by glyph, or in whole sentences. TextRegionEventFilter does not split bigger chunks of text into smaller ones before filtering. If text was written in a big chunk and you want only a part of it, the raw content needs to be preprocessed, i.e. split into smaller chunks.
Fortunately, iText provides an out of the box way to do that - the class is called GlyphTextEventListener and it can be chained to the other ITextExtractionStrategy instances. Just wrap your listener into ITextExtractionStrategy in the following way:
TextRegionEventFilter filter = new TextRegionEventFilter(new Rectangle(x1, y1, x2, y2));
ITextExtractionStrategy filteredListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
ITextExtractionStrategy fineGrainedListener = new GlyphTextEventListener(filteredListener);
new PdfCanvasProcessor(fineGrainedListener).processPageContent(page);

Creating a PDF/A-3 with form fields using iText 7 results in PdfAConformanceException

I want to use iText 7 (7.0.7 actually) to create a PDF/A-3 file with form fields.
I checked the examples and the jump tutorial to do so.
After adding a form field to here
like shown here, I get the following error:
Exception in thread "main" com.itextpdf.pdfa.PdfAConformanceException: An annotation dictionary shall contain the f key
at com.itextpdf.pdfa.checker.PdfA2Checker.checkAnnotation(PdfA2Checker.java:336)
at com.itextpdf.pdfa.checker.PdfAChecker.checkAnnotations(PdfAChecker.java:467)
at com.itextpdf.pdfa.checker.PdfAChecker.checkPage(PdfAChecker.java:446)
at com.itextpdf.pdfa.checker.PdfAChecker.checkPages(PdfAChecker.java:434)
at com.itextpdf.pdfa.checker.PdfAChecker.checkDocument(PdfAChecker.java:182)
at com.itextpdf.pdfa.PdfADocument.checkIsoConformance(PdfADocument.java:296)
at com.itextpdf.kernel.pdf.PdfDocument.close(PdfDocument.java:742)
at com.itextpdf.layout.Document.close(Document.java:120)
at pdfatest.C07E03_UnitedStates_PDFA_3a.createPdf(C07E03_UnitedStates_PDFA_3a.java:158)
at pdfatest.C07E03_UnitedStates_PDFA_3a.main(C07E03_UnitedStates_PDFA_3a.java:40)
This is the modified example code:
public void createPdf(String dest) throws IOException {
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_3A,
new PdfOutputIntent("Custom", "", "http://www.color.org",
"sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf, PageSize.A4.rotate());
document.setMargins(20, 20, 20, 20);
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-3 example");
//Add attachment
PdfDictionary parameters = new PdfDictionary();
parameters.put(PdfName.ModDate, new PdfDate().getPdfObject());
PdfFileSpec fileSpec = PdfFileSpec.createEmbeddedFileSpec(
pdf, Files.readAllBytes(Paths.get(DATA)), "united_states.csv",
"united_states.csv", new PdfName("text/csv"), parameters,
PdfName.Data, false);
fileSpec.put(new PdfName("AFRelationship"), new PdfName("Data"));
pdf.addFileAttachment("united_states.csv", fileSpec);
PdfArray array = new PdfArray();
array.add(fileSpec.getPdfObject().getIndirectReference());
pdf.getCatalog().put(new PdfName("AF"), array);
//Embed fonts
PdfFont font = PdfFontFactory.createFont(FONT, true);
PdfFont bold = PdfFontFactory.createFont(BOLD_FONT, true);
// Create content
Table table = new Table(new float[]{4, 1, 3, 4, 3, 3, 3, 3, 1});
table.setWidthPercent(100);
BufferedReader br = new BufferedReader(new FileReader(DATA));
String line = br.readLine();
process(table, line, bold, true);
while ((line = br.readLine()) != null) {
process(table, line, font, false);
}
br.close();
document.add(table);
// START additional code to add a form field
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfFormField textFormField = PdfFormField.createText(
pdf,
new Rectangle(50, 50, 200, 15),
"vo-1-text", "bla", font, 12.0f);
form.addField(textFormField);
// END additional code to add a form field
//Close document
document.close();
}
Am I missing something?

If you can switch to iText 7.1.x, then it's relatively easy: there is a new overload for createText which allows you to set aPdfAConformanceLevel parameter, so
PdfFormField textFormField = PdfFormField.createText(
pdf,
new Rectangle(50, 50, 200, 15),
"vo-1-text", "bla", font, 12.0f,
false, // multiline parameter
PdfAConformanceLevel.PDF_A_3A // <-- this is the important one
);
If you can't, or won't, switch to iText 7.1, then you can try manually triggering the code that will set the flag:
for (PdfWidgetAnnotation wid : textFormField.getWidgets()) {
wid.setFlag(PdfAnnotation.PRINT);
}
I'm not sure if this last approach will work, because you may need to have explicitly defined the PdfWidgetAnnotation, but it is much easier to go to iText 7.1.

Is it possible to mark a string in pdf document?

I was wondering if it was possible to mark strings in pdf with different color or underline them while looping through the pdf document ?

It's possible on creating a document. Just use different chunks to set the style. Here's an example:
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.add(new Chunk("This word is "));
Chunk underlined = new Chunk("underlined");
underlined.setUnderline(1.0f, -1.0f); //We can customize thickness and position of underline
document.add(underlined);
document.add(new Chunk(". And this phrase has "));
Chunk background = new Chunk("yellow background.");
background.setBackground(BaseColor.YELLOW);
document.add(background);
document.add(Chunk.NEWLINE);
document.close();
However, it's almost impossible to edit an existing PDF document. The author of iText writes in his book:
In a PDF document, every character or glyph on a PDF page has its
fixed position, regardless of the application that’s used to view the
document. This is an advantage, but it also comes with a disadvantage.
Suppose you want to replace the word “edit” with the word “manipulate”
in a sentence, you’d have to reflow the text. You’d have to reposition
all the characters that follow that word. Maybe you’d even have to
move a portion of the text to the next page. That’s not trivial, if
not impossible.
If you want to “edit” a PDF, it’s advised that you change the original
source of the document and remake the PDF.

Aspose.PDF APIs support to create new PDF document and manipulate existing PDF documents without Adobe Acrobat dependency. You can search and add Highlight Annotation to mark PDF text.
REST API Solution using Aspose.PDF Cloud SDK for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf-cloud/aspose-pdf-cloud-java
String name = "02_pages.pdf";
String folder="Temp";
String remotePath=folder+"/"+name;
// File to upload
File file = new File("C:/Temp/"+name);
// Storage name is default storage
String storage = null;
// Get App Key and App SID from https://dashboard.aspose.cloud/
PdfApi pdfApi = new PdfApi("xxxxxxxxxxxxxxxxxxxx", "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx");
//Upload file to cloud storage
pdfApi.uploadFile(remotePath,file,storage);
//Text Position
Rectangle rect = new Rectangle().LLX(259.27580539703365).LLY(743.4707997894287).URX(332.26148873138425).URY(765.5148007965088);
List<AnnotationFlags> flags = new ArrayList<>();
flags.add(AnnotationFlags.DEFAULT);
HighlightAnnotation annotation = new HighlightAnnotation();
annotation.setName("Name Updated");
annotation.rect(rect);
annotation.setFlags(flags);
annotation.setHorizontalAlignment(HorizontalAlignment.CENTER);
annotation.setRichText("Rich Text Updated");
annotation.setSubject("Subj Updated");
annotation.setPageIndex(1);
annotation.setZindex(1);
annotation.setTitle("Title Updated");
annotation.setModified("02/02/2018 00:00:00.000 AM");
List<HighlightAnnotation> annotations = new ArrayList<>();
annotations.add(annotation);
//Add Highlight Annotation to the PDF document
AsposeResponse response = pdfApi.postPageHighlightAnnotations(name,1, annotations, storage, folder);
//Download annotated PDF file from Cloud Storage
File downloadResponse = pdfApi.downloadFile(remotePath, null, null);
File dest = new File("C:/Temp/HighlightAnnotation.pdf");
Files.copy(downloadResponse.toPath(), dest.toPath(), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
System.out.println("Completed......");
On-Premise Solution using Aspose.PDF for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// Instantiate Document object
Document document = new Document("C:/Temp/Test.pdf");
// Create TextFragment Absorber instance to search particular text fragment
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Estoque");
// Iterate through pages of PDF document
for (int i = 1; i <= document.getPages().size(); i++) {
// Get first page of PDF document
Page page = document.getPages().get_Item(i);
page.accept(textFragmentAbsorber);
}
// Create a collection of Absorbed text
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Iterate on above collection
for (int j = 1; j <= textFragmentCollection.size(); j++) {
TextFragment textFragment = textFragmentCollection.get_Item(j);
// Get rectangular dimensions of TextFragment object
Rectangle rect = new Rectangle((float) textFragment.getPosition().getXIndent(), (float) textFragment.getPosition().getYIndent(), (float) textFragment.getPosition().getXIndent() + (float) textFragment.getRectangle().getWidth(), (float) textFragment.getPosition().getYIndent() + (float) textFragment.getRectangle().getHeight());
// Instantiate HighLight Annotation instance
HighlightAnnotation highLight = new HighlightAnnotation(textFragment.getPage(), rect);
// Set opacity for annotation
highLight.setOpacity(.80);
// Set the color of annotation
highLight.setColor(Color.getYellow());
// Add annotation to annotations collection of TextFragment
textFragment.getPage().getAnnotations().add(highLight);
}
// Save updated document
document.save("C:/Temp/HighLight.pdf");
P.S: I work as support/evangelist developer at Aspose.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

iText reading multicolumned PDF document - java

Related

How to use iText to parse paths (such as lines in the document)

PDF Cut Vertically and Merge

Extract text from specific position in java

Creating a PDF/A-3 with form fields using iText 7 results in PdfAConformanceException

Is it possible to mark a string in pdf document?

Categories

Resources