Extract footer data of PDF in java - java

I am able to get data from pdf pages in a string.
But along with those, footer data is also extracted. I want to remove those from all the pages of pdf. How can I remove that
I used Rectangle2D but coordinates are not giving data

In a comment the OP indicated that he used this code:
PDDocument doc = PDDocument.load("xyz.pdf");
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 1 );
Rectangle2D region = new Rectangle2D.Double(10, 10, 10, 10);
String regionName = "region";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
System.out.println("Region is "+ stripper.getTextForRegion("region"));
For most documents this code will extract no text because it looks at a small (10x10 pt) region in the upper left region of the second document page. Thus, the values in new Rectangle2D.Double(10, 10, 10, 10) have to change.
I tried with various regions , yet I am not getting any text, If you have idea for a normal pdf page , you should share
There is nothing like a normal pdf page. The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. There is no serious restriction on page dimensions or location of content on pages.
E.g. for this form
you need values like these
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
Rectangle2D region = new Rectangle2D.Float(0f, 230f, 612f, 300f);
to extract the body "I authorize any health plan ... I have received a copy of this authorization." without headers, footers, or form lines.
If you have many similar pages (e.g. one large document with many pages with a similarly layout), you have to measure but once for many pages to extract.

Related

Add HTML Markup using java Apache PDFBOX

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}
Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'
PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Adding a page with PDFBox doesn't work

I'm trying to add a page to an existing PDF-Document that I'm performing multiple different actions on before and after the page is supposed to be added.
Currently I open the page at the beginning of the document and write stuff on the first and second page of it. On the second page I add some images aswell. The Stuff that's written on the PDFs is different per PDF and sometimes it's so much stuff that two pages (or sometimes even 3) aren't enough. Now I'm trying to add a third or even fourth page once a certain amount of written text/printed images is on the second page.
Somehow no matter what I do, the third page I want to add doesn't show up in the final document. Here's my code to add the page:
if(doc.getNumberOfPages() < p+1){
PDDocument emptyDoc = PDDocument.load("./data/EmptyPage.pdf");
PDPage emptyPage = (PDPage)emptyDoc.getDocumentCatalog().getAllPages().get(0);
doc.addPage(emptyPage);;
emptyDoc.close();
}
When I check doc.getNumberOfPages() before, it says 2. Afterwards it says 3. The final document still just has 2 pages. The code after the if-clause contains multiple contentStreams that are supposed to write text on the new page (and on the first and second page).
contentStream = new PDPageContentStream(doc, (PDPage) allPages.get((int)p), true, true);
In the end, I save the document via
doc.save(tarFolder+nr.get(i)+".pdf");
doc.close();
I've created a whole new project with a class that's supposed to do the exact same thing - add a page to another PDF. This code works perfectly fine and the third page shows up - so it seems like I'm just missing something. My code works perfectly fine for page 1 + 2, we just had the new case that we need a third/fourth page sometimes lately, so I want to integrate that into my main project.
Here's the new project that works:
PDDocument doc = PDDocument.load("D:\\test.pdf");
PDDocument doc2 = PDDocument.load("D:\\EmptyPage.pdf");
List<PDPage> allPages = doc2.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(pageNumber);
doc.addPage(page);
doc.save("D:\\testoutput.pdf");
What's weird in my main project is that the third page I add gets counted by
"getNumberOfPages()"
but doesn't show up in the final product. The program throws an error if I don't add the page because it tries to write content on the third page.
Any idea what I'm doing wrong?
Thanks in advance!
Edit:
If I add the page at the beginning, when my document is loaded the first time, the page gets added and exists in my final document - like this:
doc = PDDocument.load(config.getFolder("template"));
PDDocument emptyDoc = PDDocument.load("./data/EmptyPage.pdf");
PDPage emptyPage = (PDPage)emptyDoc.getDocumentCatalog().getAllPages().get(0);
doc.addPage(emptyPage);
However, since some documents don't need that extra page, it gets unnecessarily complicated - and I feel like removing the page if it isn't needed isn't really pretty, since I'd like to avoid adding it in the first place. Maybe somebody has an idea now?
I found an answer thanks to Tilman Hausherr.
If I move the
emptyDoc.close()
to the end of my code, right after:
doc.save(tarFolder+nr.get(i)+".pdf");
doc.close();
the page shows up in the final document without any issues.

iText PDF add text in absolute position on top of the 1st page

I have a script that creates a PDF file and writes contents to it. After the execution is complete I need to write the status (fail, success) to the PDF, but the status should be on the top of the page. So the solution I came up with is to use absolute positioned text. Below is my code for the same
PdfContentByte cb = writer.DirectContent;
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
cb.SaveState();
cb.BeginText();
cb.MoveText(700, 30);
cb.SetFontAndSize(bf, 12);
cb.ShowText("My status");
cb.EndText();
cb.RestoreState();
But as the PDF creates multiple pages, this is added to the last page of the PDF. How can I add it to the 1st page??
Also is there a way to calculate the top coordinates of the page, ie the top-left coordinate?
iText was written with internet applications in mind. It was designed to flush content from memory as soon as possible: if a page is finished, that page is sent to the OutputStream and there is no way to return to that page.
That doesn't mean your requirement is impossible. PDF has a concept known as Form XObject. In iText, this concept is implemented under the name PdfTemplate. Such a PdfTemplate is a rectangular canvas with a fixed size that can be added to a page without being part of that page.
An example should clarify what that means. Please take a look at the WriteOnFirstPage example. In this example, we create a PdfTemplate like this:
PdfContentByte cb = writer.getDirectContent();
PdfTemplate message = cb.createTemplate(523, 50);
This message object refers to a Form XObject. It is a piece of content that is external to the page content.
We wrap the PdfTemplate inside an Image object. By doing so, we can add the Form XObject to the document just like any other object:
Image header = Image.getInstance(message);
document.add(header);
Now we can add as much data as we want:
for (int i = 0; i < 100; i++) {
document.add(new Paragraph("test"));
}
Adding 100 "test" lines will cause iText to create 3 pages. Once we're on page 3, we no longer have access to page 1, but we can still write content to the message object:
ColumnText ct = new ColumnText(message);
ct.setSimpleColumn(new Rectangle(0, 0, 523, 50));
ct.addElement(
new Paragraph(
String.format("There are %s pages in this document", writer.getPageNumber())));
ct.go();
If you check the resulting PDF write_on_first_page.pdf, you'll notice that the text we've added last is indeed on the first page.

PDFbox, preview List<PDPage> elements

With the follow code I open a PDF file (with filechooser object, I don't show all the code here, it's not important) and then I "put" everything in a List od PDPage, where there are all the PDF pages . If the PDF has 3 pages I will have 3 images , if the PDF has 4 pages I will have 4 images , etc ..
PDDocument document2 = PDDocument.loadNonSeq(new File(pdfFile), null);
List<PDPage> pdPages = document2.getDocumentCatalog().getAllPages();
In order to convert in BufferedImage I do this:
for (PDPage pdPage : pdPages)
{
++page;
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, pdfFile + "-" + page + ".png", 300);
}
and then I show, for example, one page in a stackpane. Now the problem, I want to show a preview of pdPages(so, no image format but PDPage) because I think is much better for big PDF. I mean, if I want to "convert" a PDF with 200 pages, it will take long time to complete the task. I thought to show a preview (maximum 10 or 15, it's not important) of the pages(in this way the user will click the interested image) but in PDPage format, not yet Image converted. I am thinking in the right way or I am wrong? In positive case, what can I do ?
Thanks in advance

Resize a pdf using pdfbox

I am attempting to view a pdf through a JFrame that contains a PDFPagePanel. While I can view the PDF, it is smaller than the actual page and therefore it appears fuzzy at best. I need the document in the full resolution and size that it is. I do NOT want to turn this into an image. I need it to stay a PDF onscreen. Here is the code I am using:
protected PDDocument originalDoc;
protected PDPage memDoc;
protected PDFPagePanel pdfPanel; // initialized in constructor
....
List<PDPage> memPDF = originalDoc.getDocumentCatalog().getAllPages();
memDoc = memPDF.get(0);
pdfPanel.setPage(memDoc);
add(pdfPanel);
setBounds(1, 1, pdfPanel.getWidth(), pdfPanel.getHeight() + 51); // set JFrame to fit the size of the document + dropdown menu
pdfPanel.setVisible(true);
I get the PDF onscreen. It's not the correct size, though. The page itself is 8.5" x 11" and when I view it in Adobe PDF the "image" is fine and is very readable. It is very obviously not scaled correctly on the screen and I cannot find any solutions as to why? This is not the same as converting this to a BufferedImage which are the solutions I see online. I need it to stay in PDPage format and display that in the PDFPagePanel class.

Categories