PDFbox, preview List<PDPage> elements - java

With the follow code I open a PDF file (with filechooser object, I don't show all the code here, it's not important) and then I "put" everything in a List od PDPage, where there are all the PDF pages . If the PDF has 3 pages I will have 3 images , if the PDF has 4 pages I will have 4 images , etc ..
PDDocument document2 = PDDocument.loadNonSeq(new File(pdfFile), null);
List<PDPage> pdPages = document2.getDocumentCatalog().getAllPages();
In order to convert in BufferedImage I do this:
for (PDPage pdPage : pdPages)
{
++page;
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, pdfFile + "-" + page + ".png", 300);
}
and then I show, for example, one page in a stackpane. Now the problem, I want to show a preview of pdPages(so, no image format but PDPage) because I think is much better for big PDF. I mean, if I want to "convert" a PDF with 200 pages, it will take long time to complete the task. I thought to show a preview (maximum 10 or 15, it's not important) of the pages(in this way the user will click the interested image) but in PDPage format, not yet Image converted. I am thinking in the right way or I am wrong? In positive case, what can I do ?
Thanks in advance

Related

Add HTML Markup using java Apache PDFBOX

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}
Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'
PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

iText PDF add text in absolute position on top of the 1st page

I have a script that creates a PDF file and writes contents to it. After the execution is complete I need to write the status (fail, success) to the PDF, but the status should be on the top of the page. So the solution I came up with is to use absolute positioned text. Below is my code for the same
PdfContentByte cb = writer.DirectContent;
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
cb.SaveState();
cb.BeginText();
cb.MoveText(700, 30);
cb.SetFontAndSize(bf, 12);
cb.ShowText("My status");
cb.EndText();
cb.RestoreState();
But as the PDF creates multiple pages, this is added to the last page of the PDF. How can I add it to the 1st page??
Also is there a way to calculate the top coordinates of the page, ie the top-left coordinate?
iText was written with internet applications in mind. It was designed to flush content from memory as soon as possible: if a page is finished, that page is sent to the OutputStream and there is no way to return to that page.
That doesn't mean your requirement is impossible. PDF has a concept known as Form XObject. In iText, this concept is implemented under the name PdfTemplate. Such a PdfTemplate is a rectangular canvas with a fixed size that can be added to a page without being part of that page.
An example should clarify what that means. Please take a look at the WriteOnFirstPage example. In this example, we create a PdfTemplate like this:
PdfContentByte cb = writer.getDirectContent();
PdfTemplate message = cb.createTemplate(523, 50);
This message object refers to a Form XObject. It is a piece of content that is external to the page content.
We wrap the PdfTemplate inside an Image object. By doing so, we can add the Form XObject to the document just like any other object:
Image header = Image.getInstance(message);
document.add(header);
Now we can add as much data as we want:
for (int i = 0; i < 100; i++) {
document.add(new Paragraph("test"));
}
Adding 100 "test" lines will cause iText to create 3 pages. Once we're on page 3, we no longer have access to page 1, but we can still write content to the message object:
ColumnText ct = new ColumnText(message);
ct.setSimpleColumn(new Rectangle(0, 0, 523, 50));
ct.addElement(
new Paragraph(
String.format("There are %s pages in this document", writer.getPageNumber())));
ct.go();
If you check the resulting PDF write_on_first_page.pdf, you'll notice that the text we've added last is indeed on the first page.

Resize a pdf using pdfbox

I am attempting to view a pdf through a JFrame that contains a PDFPagePanel. While I can view the PDF, it is smaller than the actual page and therefore it appears fuzzy at best. I need the document in the full resolution and size that it is. I do NOT want to turn this into an image. I need it to stay a PDF onscreen. Here is the code I am using:
protected PDDocument originalDoc;
protected PDPage memDoc;
protected PDFPagePanel pdfPanel; // initialized in constructor
....
List<PDPage> memPDF = originalDoc.getDocumentCatalog().getAllPages();
memDoc = memPDF.get(0);
pdfPanel.setPage(memDoc);
add(pdfPanel);
setBounds(1, 1, pdfPanel.getWidth(), pdfPanel.getHeight() + 51); // set JFrame to fit the size of the document + dropdown menu
pdfPanel.setVisible(true);
I get the PDF onscreen. It's not the correct size, though. The page itself is 8.5" x 11" and when I view it in Adobe PDF the "image" is fine and is very readable. It is very obviously not scaled correctly on the screen and I cannot find any solutions as to why? This is not the same as converting this to a BufferedImage which are the solutions I see online. I need it to stay in PDPage format and display that in the PDFPagePanel class.

Extract footer data of PDF in java

I am able to get data from pdf pages in a string.
But along with those, footer data is also extracted. I want to remove those from all the pages of pdf. How can I remove that
I used Rectangle2D but coordinates are not giving data
In a comment the OP indicated that he used this code:
PDDocument doc = PDDocument.load("xyz.pdf");
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 1 );
Rectangle2D region = new Rectangle2D.Double(10, 10, 10, 10);
String regionName = "region";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
System.out.println("Region is "+ stripper.getTextForRegion("region"));
For most documents this code will extract no text because it looks at a small (10x10 pt) region in the upper left region of the second document page. Thus, the values in new Rectangle2D.Double(10, 10, 10, 10) have to change.
I tried with various regions , yet I am not getting any text, If you have idea for a normal pdf page , you should share
There is nothing like a normal pdf page. The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. There is no serious restriction on page dimensions or location of content on pages.
E.g. for this form
you need values like these
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
Rectangle2D region = new Rectangle2D.Float(0f, 230f, 612f, 300f);
to extract the body "I authorize any health plan ... I have received a copy of this authorization." without headers, footers, or form lines.
If you have many similar pages (e.g. one large document with many pages with a similarly layout), you have to measure but once for many pages to extract.

PDFBox change page sizes and save it again

First of all, sorry for my bad english.
I´m trying to remove the Header and Footer of a PDF page, it´s necessary to search at the page break for some words, but it´s impossible with the header and footer, so it´s necessary to crop it and then convert to text than it´s "possible" to search the words.
I´m doing it:
PDDocument pdDoc = PDDocument.load("document.pdf");
PDPage page = (PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0);
PDRectangle rectangle = new PDRectangle();
rectangle.setUpperRightY(page.findCropBox().getUpperRightY() - 100);
rectangle.setLowerLeftY(page.findCropBox().getLowerLeftY());
rectangle.setUpperRightX(page.findCropBox().getUpperRightY());
rectangle.setLowerLeftX(page.findCropBox().getLowerLeftX());
page.setMediaBox(rectangle);
PDDocument document = new PDDocument();
document.addPage(page);
document.save("newDocument.pdf");
document.close();
But when I change it to HTML, it steal keeps the text that was hidden. Is there any way to save it withou the croped area and convert to html correctly?
Thanks.
Best regard´s.

Categories