How can I merge the documents consisting PDFs as well as images? - java

In my current code I am merging the documents consisting PDF files.
public static void appApplicantDownload(File file) {
Connection con = getConnection();
Scanner sc = new Scanner(System.in);
List < InputStream > list = new ArrayList < InputStream > ();
try {
OutputStream fOriginal = new FileOutputStream(file, true); // original
list.add(new FileInputStream(file1));
list.add(new FileInputStream(file2));
list.add(new FileInputStream(file3));
doMerge(list, fOriginal);
} catch (Exception e) {
}
}
public static void doMerge(List < InputStream > list, OutputStream outputStream) throws DocumentException, IOException {
try {
System.out.println("Merging...");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
for (InputStream in : list) {
ByteArrayOutputStream b = new ByteArrayOutputStream();
IOUtils.copy( in , b);
PdfReader reader = new PdfReader(b.toByteArray());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
copy.addPage(copy.getImportedPage(reader, i));
}
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
But now I want to change the code such that it should allow Image as well as PDFs to be merged. Above code is giving me the error that No PDF Signature found

First, you have to know somehow if the file is a PDF or an image. The easiest way would be to use the extension of the file. So you would have get extension of the file and then pass this information to your doMerge method. Do achieve that, I would change your current method
public static void doMerge(List<InputStream> list, OutputStream outputStream)
For something like
public static void doMerge(Map<InputStream, String> files, OutputStream outputStream)
So each InputStream is associated with a extension.
Second, you have to load images and pdf seperately. So your loop should look like
for (InputStream in : files.keySet()) {
String ext = files.get(in);
if(ext.equalsIgnoreCase("pdf"))
{
//load pdf here
}
else if(ext.equalsIgnoreCase("png") || ext.equalsIgnoreCase("jpg"))
{
//load image here
}
}
In java, you can easily load an Image using ImageIO. Look at this question for more details on this topic: Load image from a filepath via BufferedImage
Then to add your image into the PDF, use the PdfWriter
PdfWriter pw = PdfWriter.GetInstance(doc, outputStream);
Image img = Image.GetInstance(inputStream);
doc.Add(img);
doc.NewPage();
If you want to convert your image into PDF before and merge after you also can do that but you just have to use the PdfWriter to write them all first.

Create a new page in the opened PDF document and use the function for inserting images to place the image on that page.

Related

Unable to read PDF using itextpdf

I am using itextpdf-5.5.4 jar to merge or add two pdf into one PDF.
I did not get any error or exception while running code but displayed below text in Merged PDF. I did not get below text when i open individual PDF's.
The document you are trying to load requires Adobe Reader 8 or higher.
You may not have the Adobe Reader installed or your viewing
environment may not be properly configured to use Adobe Reader.
For information on how to install Adobe Reader and configure your
viewing environment please see
http://www.adobe.com/go/pdf_forms_configure.
void mergePdfFiles(List<InputStream> inputPdfList, OutputStream outputStream) throws Exception {
// Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
// Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator = inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
// Open document.
document.open();
// Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(pdfReader, currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
// Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]) {
try {
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("pdf1.pdf"));
inputPdfList.add(new FileInputStream("pdf2.pdf"));
OutputStream outputStream = new FileOutputStream("Merge-PDF.pdf");
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
mergePdfFiles(inputPdfList, byteStream);
byte[] byteS = byteStream.toByteArray();
outputStream.write(byteS);
} catch (Exception e) {
e.printStackTrace();
}
}
Please help me out on this.

Merging PDFs with iText 2.1.7 is causing them to shrink

I am using iText 2.1.7 to merge some document PDFs into a single PDF. The code below seems to work just fine, however it appears in some instances the rendered PDF is scaled slightly smaller, like at 90% of the PDF if printed directly before processing.
Is there a way to keep the current size?
private void doMerge(List<InputStream> list, OutputStream outputStream) throws DocumentException, IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (InputStream in : list) {
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
document.newPage();
//import the page from source pdf
PdfImportedPage page = writer.getImportedPage(reader, i);
//add the page to the destination pdf
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
}
The API was already available with that version. I would even suggest to use PdfSmartCopy instead of PdfCopy (or PdfCopyFields if form fields are inside).
private PdfSmartCopy copier;
public void SomeMainMethod(){
Document finalPdf = new Document();
copier = new PdfSmartCopy(finalPdf, outputstream);
//Start adding pdfs
finalPdf.open();
//add n documents
addDocuments(...);
finalPdf.close();
formCopier.close();
}
public void addDocument(InputStream pdfDocument, int startPage, int endPage){
PdfReader reader= new PdfReader(pdfDocument);
int startPage = 1;
int endPage = reader.getNumberOfPages();
for (int i = startPage; i <= endPage; i++) {
copier.addPage(this.copier.getImportedPage(reader,i));
}
if(copier!=null){
//Important: Free Memory!
copier.flush();
copier.freeReader(reader);
}
if (reader!=null) {
reader.close();
reader=null;
}
pdfDocument.close();
}

Read data from image in PDF

I am using iText java TextExtraction to read text from PDF file. I use below code and it works fine for PDF in English Now I have PDF containing data as image. I want to read data from that image
public class pdfreader {
public static void main(String[] args) throws IOException, DocumentException, TransformerException {
String SRC = "";
String DEST = "";
for (String s : args) {
SRC = args[0];
DEST = args[1];
}
File file = new File(DEST);
file.getParentFile().mkdirs();
new pdfreader().readText(SRC, DEST);
}
public void readText(String src, String dest) throws IOException, DocumentException, TransformerException {
try {
PdfReader pdfReader = new PdfReader(src);
PdfReaderContentParser PdfParser = new PdfReaderContentParser(
pdfReader);
PrintWriter out = new PrintWriter(new FileOutputStream(
dest));
TextExtractionStrategy textStrategy;
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
textStrategy = PdfParser.processContent(i,
new SimpleTextExtractionStrategy());
out.println(textStrategy.getResultantText());
}
out.flush();
out.close();
pdfReader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
You can implement an OCR workflow with iText. As Amedee already hinted, this is something we have tried at iText, with very promising results.
The algorithm (high level):
Implement IEventListener to parse pages of your document
Look out for ImageRenderInfo events, they are fired when the PDF parser hits an image
You can call getImage() on the event and ultimately get a BufferedImage
Feed the BufferedImage to Tesseract
Apply the coordinate transform (tesseract does not use the same coordinate space as iText)
Now that you have the texf in the image, and the location, you can use iText to overlay text on your PDF. Or simply extract it.
iText doesn't support OCR to extract text from images. Try to use Tesseract or something else.

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.
This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks
I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}
Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}
I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}

How to read Images in the .doc file

How to read images in the ms-office .doc file using Apache poi? I have tried with the following code but it is not working.
try {
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\ImageDocument.doc"));
Document document = new Document();
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/ImageDocumentPDF.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
HWPFDocument hdocument=new HWPFDocument(fs);
Range range=hdocument.getOverallRange();
PdfPTable createTable;
CharacterRun run;
PicturesTable picture=hdocument.getPicturesTable();
int picoffset=run.getPicOffset();
for(int i=0;i<range.numParagraphs();i++) {
run =range.getCharacterRun(i);
if(picture.hasPicture(run)) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
}
}
When i execute the above code and prints the picture offset value it displays -1
and when print picture.hasPicture(run) it returns false though the input file has an image.
Please help me to find the solution.
Thank you
public static List<byte[]> extractImagesFromWord(File file) {
if (file.exists()) {
try {
List<byte[]> result = new ArrayList<byte[]>();
if ("docx".equals(getMimeType(file).getExtension())) {
org.apache.poi.xwpf.usermodel.XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
for (org.apache.poi.xwpf.usermodel.XWPFPictureData picture : doc.getAllPictures()) {
result.add(picture.getData());
}
} else if ("doc".equals(getMimeType(file).getExtension())) {
org.apache.poi.hwpf.HWPFDocument doc = new HWPFDocument(new FileInputStream(file));
for (org.apache.poi.hwpf.usermodel.Picture picture : doc.getPicturesTable().getAllPictures()) {
result.add(picture.getContent());
}
}
return result;
} catch (Exception e) {
throw new RuntimeException( e);
}
}
return null;
}
it worked for me, if picOffset returns -1, it means there is no image for current CharacterRun

Categories