Replace PDF page using PDFBox - java

I have two PDF files (named : A1.pdf and B1.pdf). Now I want to replace the some pages of the second PDF file (B1.pdf) with the first one (A1.pdf) programatically. In this case I am using PDFBox library.
Here is my sample code:
try {
File file = new File("/Users/test/Desktop/A1.pdf");
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = PDDocument.load(new File("/Users/test/Desktop/B1.pdf"));
document.removePage(3);
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0));
document.save("/Users/test/Desktop/"+"generatedPDFBox"+".pdf");
document.close();
}catch(Exception e){}
The idea is to replace the 3rd page. In this implementation the page is appending to the last page of the output pdf. Can anyone help me to implement this? If not with PDFBOX. Could you please suggest some other libraries in java?

This solution creates a third PDF file with the contents like you asked for. Note that pages are zerobased, so the "3" in your question must be a "2".
PDDocument a1doc = PDDocument.load(file1);
PDDocument b1doc = PDDocument.load(file2);
PDDocument resDoc = new PDDocument();
List<PDPage> a1Pages = a1doc.getDocumentCatalog().getAllPages();
List<PDPage> b1Pages = b1doc.getDocumentCatalog().getAllPages();
// replace the 3rd page of the 2nd file with the 1st page of the 1st one
for (int p = 0; p < b1Pages.size(); ++p)
{
if (p == 2)
resDoc.addPage(a1Pages.get(0));
else
resDoc.addPage(b1Pages.get(p));
}
resDoc.save(file3);
a1doc.close();
b1doc.close();
resDoc.close();
If you want to work from the command line instead, look here:
https://pdfbox.apache.org/commandline/
Then use PDFSplit and PDFMerge.

I am not too familiar with how PDFBox works, but to answer your follow up I know you can accomplish what you want to do in a fairly simple manner with the Datalogics APDFL SDK. A free trial exists in case you want to look into it. Here is a code snippet so you can see how it would be done:
Document Doc1 = new Document("/Users/test/Desktop/A1.pdf");
Document Doc2 = new Document("/Users/test/Desktop/B1.pdf");
/* Delete pages on the page range 3-3*/
Doc2.deletePages(3, 3)
/* LastPage is where in Doc2 you want to insert the page, Doc1 the document from which the page is coming from, 0 is the page number in Doc1 that will be inserted first, 1 is the number of pages that will be inserted (beginning from the page number specified in the previous parameter), and PageInsertFlags which would let you customize what gets / doesn't get copied */
Doc2.insertPages(Document.LastPage, Doc1, 0, 1, PageInsertFlags.All);
Doc2.save(EnumSet.of(SaveFlags.FULL), "out.pdf")
Alternatively, there is another method called replacePages which makes the deletion unnecessary. It all depends on what your end goal is, of course.

Related

How to get content of a page of a .docx file using Apache Poi?

I'm trying to read .docx files with styling information using Apache Poi which I have done by looping through each XWPFParagraph and working with all the XWPFRun run inside the paragraphs. Now I want to get contents of each pages. So is there a way to get the contents of each pages or is it possible to know in which page a paragraph is currently in?
This is a function that takes the absolute path of a docx file and returns an array of strings
FileInputStream fis = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fis);
List<IBodyElement> bodyElements = document.getBodyElements();
List<String> textList = new ArrayList<>();
/* I want to add some kind of outer loop here for each page
and at the end of that loop I want to add a "<hr/>" tag in the textList
*/
for (IBodyElement bodyElement : bodyElements) { // Looping through paragraphs
if (bodyElement.getElementType() == BodyElementType.PARAGRAPH) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
String textToAdd = parseParagraph(paragraph); //custom funtion to handle paragraphs
textList.add(textToAdd);
}
}
document.close();
return textList.toArray(new String[0]);
As you can see my goal here is to add a <hr/> tag after each page. So, if somehow I can get the page number of a paragraph or loop through pages, I will be able to do that.
Please kindly mention if you know about any other approach that may help.
To get Page Count from XWPFDocument (for your outer loop), you can do something like this:
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(YOUR_FILE_PATH));
int numOfPages = docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
For your paragraph text,
for (XWPFParagraph p : document.getParagraphs()) {
System.out.println(p.getParagraphText()); // YOUR PARAGRAPH TEXT
}

Deleting Annots from each page in pdf document

I have downloaded pdf file from one site, and on each page there is hyperlink to this site in a rectangle. I want to remove link from every page.
I am using PDFBox version 2.0.8
I figured out that link description is located in ANNOTS in every page of the document. I deleted ANOOTS corresponding to link. Of cause I set needToUpdated flag true to every node in the chain from the PDF catalog.
In debug mode I see that readOnly flag is set to true in AccessPermission object.
When I open edited pdf file all pages are empty and for every page Acrobat Reader shows the following error:
There was an error processing a page. Invalid Function resource.
I have several questions:
Can I programmatically change the pdf file when readOnly flag is set
to true in AccessPermission object?
Why I get error described above?
What do I need to do to remove unnecessary link from page and every
page display properly in pdf document?
Here is my code(sorry for quality this is only draft):
File book = new File(path_to_pdf_file);
PDDocument document = PDDocument.load(book);
document.setAllSecurityToBeRemoved(true);
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
dictionary.removeItem(COSName.PERMS);
dictionary.setNeedToBeUpdated(true);
((COSObject) document.getDocumentCatalog().getCOSObject().getItem(COSName.PAGES)).setNeedToBeUpdated(true);
dictionary = document.getDocumentCatalog().getPages().getCOSObject();
dictionary.setNeedToBeUpdated(true);
COSArray arr = (COSArray) dictionary.getDictionaryObject(COSName.KIDS);
arr.setNeedToBeUpdated(true);
COSArray arrayForLoop;
COSDictionary tempDic;
for (int k = 0; k < arr.size(); ++k) {
COSObject object = (COSObject) arr.get(k);
object.setNeedToBeUpdated(true);
dictionary = (COSDictionary) object.getObject();
dictionary.setNeedToBeUpdated(true);
arrayForLoop = (COSArray) dictionary.getItem(COSName.ANNOTS);
arrayForLoop.setNeedToBeUpdated(true);
arrayForLoop = (COSArray) arrayForLoop.getCOSObject();
arrayForLoop.setNeedToBeUpdated(true);
dictionary = (COSDictionary) arrayForLoop.get(0);
dictionary.setNeedToBeUpdated(true);
dictionary.removeItem(COSName.TYPE);
dictionary.removeItem(COSName.SUBTYPE);
dictionary.removeItem(COSName.RECT);
dictionary.removeItem(COSName.BORDER);
tempDic = (COSDictionary) dictionary.getItem(COSName.A);
tempDic.setNeedToBeUpdated(true);
dictionary.removeItem(COSName.A);
}
document.saveIncremental(new FileOutputStream(path_to_save_file));
document.close();
In code above I iterate over every page, delete ANNOTS that corresponding to
link. Also I used saveIncremental method to traverse all modified nodes from leaf to root.
Thank you for your answers.

How to save a jsoup document as text file

I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();
Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}
Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());
After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.

PDFBox 2.0.3: PDDocument Scratch File Already Closed

I'm generating a PDDocument in Java with code like this...
HashMap<Integer, PDPageContentStream> mPageContentStreamMap = new HashMap<>();
PDDocument doc = new PDDocument();
for (int i = 1; i <= mNumPages; i++) {
PDPage page = new PDPage(PDRectangle.A4);
page.setRotation(90);
PDPageContentStream pageContentStream = new PDPageContentStream(doc, page);
contentStreamMap.put(i, pageContentStream);
doc.addPage(page);
}
}
Then later save and close the document like this...
for (int i : mPageContentStreamMap.keySet()) {
mPageContentStreamMap.get(i).close();
}
doc.save("test-filename");
doc.close();
This works fine on the first run; however when I run my program multiple times I get the following error
java.io.IOException: Scratch file already closed
at org.apache.pdfbox.io.ScratchFile.checkClosed(ScratchFile.java:390)
at org.apache.pdfbox.io.ScratchFileBuffer.<init>(ScratchFileBuffer.java:78)
at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:403)
at org.apache.pdfbox.cos.COSStream.createOutputStream(COSStream.java:208)
at org.apache.pdfbox.pdmodel.common.PDStream.createOutputStream(PDStream.java:224)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:259)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:121)
If I re-run my program without the "doc.close();" line, this error goes away, but the output of the PDF is duplicated (i.e. a new PDF is generated, but with the content from the last PDF and the content from the current PDF).
Is there a way to close the stream and create multiple PDFs without running into the scratch file error?
I had created a singleton object for my drawing logic meaning after the first run, the same objects were reused when they shouldn't've been, because the input (what was being drawn) had changed.

Convert HTML to PDF and add it to a paragraph

I want to add a paragraph, containing HTML, to a document.
As far as I know, iText only supports adding HTML to a document directly via XMLWorkerHelper.
Furthermore I want to change the font of the HTML, but this can be done with a css-file.
My approach is similar to this code:
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(pdfWriter, document, fis);
But this solution is writing to the document directly. I want to add the HTML to a paragraph, so I may add some additional formatting to that section.
String html = "<p>Html code here</p>";
Paragraph comb = new Paragraph();
StringBuilder sb = new StringBuilder();
sb.append(html);
ElementList list = XMLWorkerHelper.parseToElementList(sb.toString(), null);
for (Element element : list) {
comb.add(element);
}
para = new Paragraph(comb);
cell = new PdfPCell(para);
cell.setHorizontalAlignment(Element.ALIGN_LEFT);
cell.setBorder(Rectangle.NO_BORDER);
cell.setPaddingTop(0);
cell.setPaddingBottom(15f);
cell.setLeading(3f, 1.2f);
table.addCell(cell);
Go to parsing HTML step by step. In that example, the final pipeline is a PdfWriterPipeline which isn't what you want (because this pipeline writes stuff to the document). You want to replace this final pipeline with an ElementHandlerPipeline, converting all the HTML tags that are encountered to an ElementList.
Once you have this list of Element instances, it's up to you to decide what to do with it (adding them to a Paragraph is one option).

Categories