Trying Merge document with PDFMergerUtility in pdfbox 2.00 - java

We have a PDF doc of 10 pages. We are in a need to rearrange the pages and split them into 3 or 4 documents. We were using Pdfbox Merge Document with 1.8.xx as like mergePdf.mergeDocuments() it working fine .now pdfbox version 2.0.0 the sequence is going wrong post rearranging merging 10 pages into 3 separate document.I have tried both setuptemp and setupmain. Both are not providing any positive inputs as well.
Pdfbox 1.8 code sample:
PDDocument document = PDDocument.load(new File(sourceFile));
PDFMergerUtility PDFmerger = new PDFMergerUtility();
Splitter splitter = new Splitter();
splitter.setStartPage(fDStartPage);
splitter.setSplitAtPage((fDEndPage));
List<PDDocument> splittedDocuments = splitter.split(document);
PDFmerger.addSource(getInputStream(splittedDocuments.get(0)));
PDFmerger.setDestinationFileName(destinationFile);
PDFmerger.mergeDocuments();
PDFbox 2.0 code sample:
PDFMergerUtility pdfmerger = new PDFMergerUtility();
PDDocument document = PDDocument.load(new File(filename));
pdfmerger.setDestinationFileName(mergedFileName);
Splitter splitter = new Splitter();
splitter.setStartPage(9);
splitter.setSplitAtPage(10);
List<PDDocument> document1 splitter.split(document);
InputStream is = null;
ByteArrayOutputStream out = new ByteArrayOutputStream();
document1.get(0).save(out);
byte[] data = out.toByteArray();
is = new ByteArrayInputStream(data);
pdfmerger.addSource(is);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
document.close();

Related

extracting data by table/header using pdfbox

I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.
Here are the PDF examples:
I want to get the lists and amount of each list from the Summary Of Charges Header.
Based on the summary of charges, there are the amount and qty for each charges.
this is my code for the metadata which metadata variable comes out as null:
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
PDDocumentInformation info = document.getDocumentInformation();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
InputStream stream = metadata.createInputStream();
}
document.close();`
and this code basically just give me chunks of texts.
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
//InputStream stream = metadata.createInputStream();
System.out.print("Text:" + text);
}
document.close();

iText 7 Html to Pdf conversion and linking external file to the generated pdf

I am encountering an issue while merging two PDFs generated out of IText.
I am new to iText7
I am creating one pdf from html and creating another pdf with excel(.xls) as embedded document to pdf.
I want to merge the 2 files.
Basically I want to generate a PDF from html then attach a excel document to it and then output combined html outPutStream from these two pdfs.
Below is the code I am using
ByteArrayOutputStream htmlToPdfContent = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(htmlToPdfContent);
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.A4.rotate();
pdf.setDefaultPageSize(pageSize);
ConverterProperties properties = new ConverterProperties();
HtmlConverter.convertToPdf(htmlContent, pdf, properties);
FileUtils.cleanDirectory(new File(outputDir));
ByteArrayOutputStream pdfResult = new ByteArrayOutputStream();
PdfWriter writerResult = new PdfWriter(pdfResult);
PdfDocument pdfDocResult = new PdfDocument(writerResult);
PdfReader reader = new PdfReader(new ByteArrayInputStream(htmlToPdfContent.toByteArray()));
PdfDocument pdfDoc = new PdfDocument(reader);
pdfDoc.copyPagesTo(1, pdfDoc.getNumberOfPages(), pdfDocResult);
ByteArrayOutputStream pdfAttach = new ByteArrayOutputStream();
PdfDocument pdfLaunch = new PdfDocument(new PdfWriter(pdfAttach));
Rectangle rect = new Rectangle(36, 700, 100, 100);
byte[] embeddedFileContentBytes = Files.readAllBytes(Paths.get(excelPath));
PdfFileSpec fs = PdfFileSpec.createEmbeddedFileSpec(pdfLaunch, embeddedFileContentBytes, null, "test.xlsx", null, null);
PdfAnnotation attachment = new PdfFileAttachmentAnnotation(rect, fs)
.setContents("Click me");
pdfLaunch.addNewPage().addAnnotation(attachment);
PdfDocument appliedChanges = new PdfDocument(new PdfReader(new ByteArrayInputStream(pdfAttach.toByteArray())));
appliedChanges.copyPagesTo(1, appliedChanges.getNumberOfPages(), pdfDocResult);
try(OutputStream outputStream = new FileOutputStream(dest)) {
pdfResult.writeTo(outputStream);
}
This is throwing exception
13:56:05.724 [main] ERROR com.itextpdf.kernel.pdf.PdfReader - Error occurred while reading cross reference table. Cross reference table will be rebuilt.
com.itextpdf.io.IOException: Error at file pointer 19,272.
at com.itextpdf.io.source.PdfTokenizer.throwError(PdfTokenizer.java:678)
at com.itextpdf.kernel.pdf.PdfReader.readXrefSection(PdfReader.java:801)
at com.itextpdf.kernel.pdf.PdfReader.readXref(PdfReader.java:774)
at com.itextpdf.kernel.pdf.PdfReader.readPdf(PdfReader.java:538)
at com.itextpdf.kernel.pdf.PdfDocument.open(PdfDocument.java:1818)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:238)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:221)
at com.mediaocean.prisma.order.command.infrastructure.pdf.itext.PdfAttachmentLaunch.main(PdfAttachmentLaunch.java:76)
Caused by: com.itextpdf.io.IOException: xref subsection not found.
... 8 common frames omitted
Exception in thread "main" com.itextpdf.kernel.PdfException: Trailer not found.
at com.itextpdf.kernel.pdf.PdfReader.rebuildXref(PdfReader.java:1064)
at com.itextpdf.kernel.pdf.PdfReader.readPdf(PdfReader.java:543)
at com.itextpdf.kernel.pdf.PdfDocument.open(PdfDocument.java:1818)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:238)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:221)
at com.mediaocean.prisma.order.command.infrastructure.pdf.itext.PdfAttachmentLaunch.main(PdfAttachmentLaunch.java:88)
13:56:05.773 [main] ERROR com.itextpdf.kernel.pdf.PdfReader - Error occurred while reading cross reference table. Cross reference table will be rebuilt.
com.itextpdf.io.IOException: PDF startxref not found.
at com.itextpdf.io.source.PdfTokenizer.getStartxref(PdfTokenizer.java:262)
at com.itextpdf.kernel.pdf.PdfReader.readXref(PdfReader.java:753)
at com.itextpdf.kernel.pdf.PdfReader.readPdf(PdfReader.java:538)
at com.itextpdf.kernel.pdf.PdfDocument.open(PdfDocument.java:1818)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:238)
at com.itextpdf.kernel.pdf.PdfDocument.<init>(PdfDocument.java:221)
at com.mediaocean.prisma.order.command.infrastructure.pdf.itext.PdfAttachmentLaunch.main(PdfAttachmentLaunch.java:88)
Please advise. Thanks in advance !!
Concerning revision 2 of your question
You changed your code differently than proposed in my answer to the first revision of your question, you now convert into the formerly unused PdfDocument pdf instead of directly into the ByteArrayOutputStream htmlToPdfContent.
This actually also is a possible fix of the problem identified in that answer. Thus, you don't get an exception here anymore:
PdfReader reader = new PdfReader(new ByteArrayInputStream(htmlToPdfContent.toByteArray()));
PdfDocument pdfDoc = new PdfDocument(reader);
Instead you now get an exception further down the flow, here:
PdfDocument appliedChanges = new PdfDocument(new PdfReader(new ByteArrayInputStream(pdfAttach.toByteArray())));
And the reason is simple, you have not yet closed the PdfDocument pdfLaunch which writes to the ByteArrayOutputStream pdfAttach. But only closing finalizes the PDF in the output stream. Thus, add the close():
ByteArrayOutputStream pdfAttach = new ByteArrayOutputStream();
PdfDocument pdfLaunch = new PdfDocument(new PdfWriter(pdfAttach));
[...]
pdfLaunch.addNewPage().addAnnotation(attachment);
pdfLaunch.close(); //<==== added
PdfDocument appliedChanges = new PdfDocument(new PdfReader(new ByteArrayInputStream(pdfAttach.toByteArray())));
And you actually do the same mistake again, shortly after, you store the contents of the ByteArrayOutputStream pdfResult to outputStream without closing the PdfDocument pdfDocResult which writes to pdfResult. Thus, also add a close call there:
appliedChanges.copyPagesTo(1, appliedChanges.getNumberOfPages(), pdfDocResult);
pdfDocResult.close(); //<==== added
try(OutputStream outputStream = new FileOutputStream(dest)) {
pdfResult.writeTo(outputStream);
}
Concerning revision 1 of your question
You use the ByteArrayOutputStream htmlToPdfContent as target of two distinct PDF generators, the PdfDocument pdf via the PdfWriter writer and the HtmlConverter.convertToPdf call:
ByteArrayOutputStream htmlToPdfContent = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(htmlToPdfContent);
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.A4.rotate();
pdf.setDefaultPageSize(pageSize);
ConverterProperties properties = new ConverterProperties();
HtmlConverter.convertToPdf(content, htmlToPdfContent, properties);
This makes the content of htmlToPdfContent a hodgepodge of the outputs of both of them, in particular not a valid PDF.
As you don't add any content to pdf, you can safely remove it and reduce the above excerpt to
ByteArrayOutputStream htmlToPdfContent = new ByteArrayOutputStream();
ConverterProperties properties = new ConverterProperties();
HtmlConverter.convertToPdf(content, htmlToPdfContent, properties);

iText converting incomplete Html file content to pdf using java

I am trying to convert html file into pdf using iText lib(4.2.0). But the problem is it's not printing all the html content to pdf, its only partially printing some data. Here is the code to convert html to pdf.
InputStream il = new FileInputStream("/tmp/test.html");
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("/tmp/pdf.pdf"));
writer.setInitialLeading(12.5f);
// step 3
document.open();
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
// CSS
CSSResolver cssResolver = new StyleAttrCSSResolver();
InputStream csspathtest =new FileInputStream("/tmp/test.css");
CssFile cssfiletest = XMLWorkerHelper.getCSS(csspathtest);
cssResolver.addCss(cssfiletest);
Pipeline<?> pipeline = new CssResolverPipeline(cssResolver,
new HtmlPipeline(htmlContext,
new PdfWriterPipeline(
document, writer)));
XMLWorker worker = new XMLWorker(pipeline, true);
XMLParser p = new XMLParser(worker);
p.parse(il);
// step
document.close();
Here is the sample html file http://codepaste.net/65kmhp

Insert .pdf doc or .png image content into a .docx file using java

How can I insert pdf or png content into a docx file using java?
I've tried using Apache POI API in the following way, but it is not working (it generates some junk doc file):
XWPFDocument doc = new XWPFDocument();
String pdf = "D://capture1.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
}
FileOutputStream out1 = new FileOutputStream("D://javadomain1.docx");
doc.write(out1);
out1.close();
reader.close();
System.out.println("Document converted successfully");
You should be able to do it with POI, and you can certainly do it using docx4j.
Here's sample code for inserting an image using docx4j.
Note that to "insert a PDF", you need to OLE embed it. That's more difficult, since you need to convert the PDF to a suitable binary OLE object. In docx4j, helper code for doing this is part of the commercial Enterprise edition.

PDFBox bloated PDF file size

Using PDFBox can read Dynamic PDF created by livecycle. The code below reads then writes back the xml file that used to create the dynamic PDF. I bit concerned as the resulting file is quite large start out with 647kb pdf. The new pdf 14000kb. Anybody know how can reduce the size of the new file produced. Can set some type of compression when writing back to pdf file?
PDDocument doc = PDDocument.load("filename");
doc.setAllSecurityToBeRemoved(true);
PDDocumentCatalog docCatalog = doc.getDocumentCatalog();
PDAcroForm form = docCatalog.getAcroForm();
PDXFA xfa = form.getXFA();
COSBase cos = xfa.getCOSObject();
COSStream coss = (COSStream) cos;
InputStream cosin = coss.getUnfilteredStream();
Document document = documentBuilder.parse(cosin);
COSStream cosout = new COSStream(new RandomAccessBuffer());
OutputStream out = cosout.createUnfilteredStream();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(xmlDoc);
StreamResult result = new StreamResult(out);
transformer.transform(source, result);
PDXFA xfaout = new PDXFA(cosout);
form.setXFA(xfaout);
set a filter:
COSStream cosout = new COSStream(new RandomAccessBuffer());
cosout.setFilters(COSName.FLATE_DECODE);
this will set the Flate filter, which is pretty good in most of the cases.

Categories