How to read Images in the .doc file

How to read Images in the .doc file - java

How to read images in the ms-office .doc file using Apache poi? I have tried with the following code but it is not working.
try {
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\ImageDocument.doc"));
Document document = new Document();
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/ImageDocumentPDF.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
HWPFDocument hdocument=new HWPFDocument(fs);
Range range=hdocument.getOverallRange();
PdfPTable createTable;
CharacterRun run;
PicturesTable picture=hdocument.getPicturesTable();
int picoffset=run.getPicOffset();
for(int i=0;i<range.numParagraphs();i++) {
run =range.getCharacterRun(i);
if(picture.hasPicture(run)) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
}
}
When i execute the above code and prints the picture offset value it displays -1
and when print picture.hasPicture(run) it returns false though the input file has an image.
Please help me to find the solution.
Thank you

public static List<byte[]> extractImagesFromWord(File file) {
if (file.exists()) {
try {
List<byte[]> result = new ArrayList<byte[]>();
if ("docx".equals(getMimeType(file).getExtension())) {
org.apache.poi.xwpf.usermodel.XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
for (org.apache.poi.xwpf.usermodel.XWPFPictureData picture : doc.getAllPictures()) {
result.add(picture.getData());
}
} else if ("doc".equals(getMimeType(file).getExtension())) {
org.apache.poi.hwpf.HWPFDocument doc = new HWPFDocument(new FileInputStream(file));
for (org.apache.poi.hwpf.usermodel.Picture picture : doc.getPicturesTable().getAllPictures()) {
result.add(picture.getContent());
}
}
return result;
} catch (Exception e) {
throw new RuntimeException( e);
}
}
return null;
}

it worked for me, if picOffset returns -1, it means there is no image for current CharacterRun

Related

Unable to read PDF using itextpdf

I am using itextpdf-5.5.4 jar to merge or add two pdf into one PDF.
I did not get any error or exception while running code but displayed below text in Merged PDF. I did not get below text when i open individual PDF's.
The document you are trying to load requires Adobe Reader 8 or higher.
You may not have the Adobe Reader installed or your viewing
environment may not be properly configured to use Adobe Reader.
For information on how to install Adobe Reader and configure your
viewing environment please see
http://www.adobe.com/go/pdf_forms_configure.
void mergePdfFiles(List<InputStream> inputPdfList, OutputStream outputStream) throws Exception {
// Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
// Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator = inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
// Open document.
document.open();
// Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(pdfReader, currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
// Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]) {
try {
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("pdf1.pdf"));
inputPdfList.add(new FileInputStream("pdf2.pdf"));
OutputStream outputStream = new FileOutputStream("Merge-PDF.pdf");
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
mergePdfFiles(inputPdfList, byteStream);
byte[] byteS = byteStream.toByteArray();
outputStream.write(byteS);
} catch (Exception e) {
e.printStackTrace();
}
}
Please help me out on this.

How can I merge the documents consisting PDFs as well as images?

In my current code I am merging the documents consisting PDF files.
public static void appApplicantDownload(File file) {
Connection con = getConnection();
Scanner sc = new Scanner(System.in);
List < InputStream > list = new ArrayList < InputStream > ();
try {
OutputStream fOriginal = new FileOutputStream(file, true); // original
list.add(new FileInputStream(file1));
list.add(new FileInputStream(file2));
list.add(new FileInputStream(file3));
doMerge(list, fOriginal);
} catch (Exception e) {
}
}
public static void doMerge(List < InputStream > list, OutputStream outputStream) throws DocumentException, IOException {
try {
System.out.println("Merging...");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
for (InputStream in : list) {
ByteArrayOutputStream b = new ByteArrayOutputStream();
IOUtils.copy( in , b);
PdfReader reader = new PdfReader(b.toByteArray());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
copy.addPage(copy.getImportedPage(reader, i));
}
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
But now I want to change the code such that it should allow Image as well as PDFs to be merged. Above code is giving me the error that No PDF Signature found

First, you have to know somehow if the file is a PDF or an image. The easiest way would be to use the extension of the file. So you would have get extension of the file and then pass this information to your doMerge method. Do achieve that, I would change your current method
public static void doMerge(List<InputStream> list, OutputStream outputStream)
For something like
public static void doMerge(Map<InputStream, String> files, OutputStream outputStream)
So each InputStream is associated with a extension.
Second, you have to load images and pdf seperately. So your loop should look like
for (InputStream in : files.keySet()) {
String ext = files.get(in);
if(ext.equalsIgnoreCase("pdf"))
{
//load pdf here
}
else if(ext.equalsIgnoreCase("png") || ext.equalsIgnoreCase("jpg"))
{
//load image here
}
}
In java, you can easily load an Image using ImageIO. Look at this question for more details on this topic: Load image from a filepath via BufferedImage
Then to add your image into the PDF, use the PdfWriter
PdfWriter pw = PdfWriter.GetInstance(doc, outputStream);
Image img = Image.GetInstance(inputStream);
doc.Add(img);
doc.NewPage();
If you want to convert your image into PDF before and merge after you also can do that but you just have to use the PdfWriter to write them all first.

Create a new page in the opened PDF document and use the function for inserting images to place the image on that page.

Setting Password Protection on XSSF Workbook

I would like to add password protection to a xlsx file created with poi 3.14.
The documentation claims, that this is possible:
http://poi.apache.org/encryption.html
Using the example I tried it like this:
public static void main(String[] args)
{
try(Workbook wb = new XSSFWorkbook())
{
//<...>
try(ByteArrayOutputStream baos = new ByteArrayOutputStream())
{
wb.write(baos);
byte[] res = baos.toByteArray();
try(ByteArrayInputStream bais = new ByteArrayInputStream(res))
{
try(POIFSFileSystem fileSystem = new POIFSFileSystem(bais);) // Exception happens here
{
EncryptionInfo info = new EncryptionInfo(EncryptionMode.agile);
Encryptor enc = info.getEncryptor();
enc.confirmPassword("pass");
OutputStream encryptedDS = enc.getDataStream(fileSystem);
OPCPackage opc = OPCPackage.open(new File("example.xlsx"), PackageAccess.READ_WRITE);
opc.save(encryptedDS);
opc.close();
}
}
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
Unfortunately, the code in the example is not compatible to XLSX files and as a result I receive the following exception:
The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
Can anybody help please? I am unable to find the correct alternative for XLSX...
Thank you all for you help. Here is my working result:
public static void main(String[] args)
{
try(Workbook wb = new XSSFWorkbook())
{
Sheet sheet = wb.createSheet();
Row r = sheet.createRow(0);
Cell cell = r.createCell(0);
cell.setCellType(Cell.CELL_TYPE_STRING);
cell.setCellValue("Test");
try(POIFSFileSystem fileSystem = new POIFSFileSystem();)
{
EncryptionInfo info = new EncryptionInfo(EncryptionMode.standard);
Encryptor enc = info.getEncryptor();
enc.confirmPassword("pass");
OutputStream encryptedDS = enc.getDataStream(fileSystem);
wb.write(encryptedDS);
FileOutputStream fos = new FileOutputStream("C:/example.xlsx");
fileSystem.writeFilesystem(fos);
fos.close();
}
}
catch(Exception e)
{
e.printStackTrace();
}
}

You've mis-read the documentation on encrypting in OOXML file. You're therefore incorrectly trying to load your file using the wrong code, when you just need to save it
Without any error handling, your code basically wants to be
// Prepare
POIFSFileSystem fs = new POIFSFileSystem();
EncryptionInfo info = new EncryptionInfo(EncryptionMode.agile, CipherAlgorithm.aes192, HashAlgorithm.sha384, -1, -1, null);
Encryptor enc = info.getEncryptor();
enc.confirmPassword("foobaa");
// Create the normal workbook
Workbook wb = new XSSFWorkbook();
Sheet s = wb.createSheet();
// TODO Populate
// Encrypt
OutputStream os = enc.getDataStream(fs);
wb.save(os);
opc.close();
// Save
FileOutputStream fos = new FileOutputStream("protected.xlsx");
fs.writeFilesystem(fos);
fos.close();

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.

This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks

I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}

Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}

I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}

Retrieving content of hyperlinked slides in powerpoint files(.PPTX) through apache POI

I am trying to get the text content of powerpoint files and replace with some other text. I have a powerpoint file of 20 slides. where 13,14,15,16 slides have hyperlink to 17,18,19 and 20th slide. I am using XMLSlideshow to traverse through the slides, But it gives only 16 slides. It does not give last 4 hyperlinked slides.
Any idea really appreaciable in advance how can I get content of all hyperlinked slides and Replace by some other text.
here is my code.
public static void replaceContentInPPTX(File inputFile, File outputFile) throws IOException{
FileInputStream fis = null;
FileOutputStream fos = null;
XMLSlideShow ppt = null;
try{
fis = new FileInputStream(inputFile);
fos = new FileOutputStream(outputFile);
ppt = new XMLSlideShow(fis);
// System.out.println("Available slide layouts:"+ppt.getSlideMasters().length);
/* for(XSLFSlideMaster master : ppt.getSlideMasters()){
XSLFShape[] shape = master.getShapes();
for(XSLFSlideLayout layout : master.getSlideLayouts()){
System.out.println(layout.getType());
}
}*/
System.out.println("No of slides:"+ppt.getSlides().length); // gives 16 slides.
for(XSLFSlide slide : ppt.getSlides()) {
for(XSLFShape shape : slide){
if(shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape)shape;
for (XSLFTextParagraph xslfParagraph : txShape.getTextParagraphs()) {
String originalText = replaceUnwantedChar(xslfParagraph.getText());
if(! originalText.isEmpty()) {
String translation = "";
if(translation != null ) {
CTRegularTextRun[] ctRegularTextRun = xslfParagraph.getXmlObject().getRArray();
for(int index = ctRegularTextRun.length-1; index > 0 ; index--){
xslfParagraph.getXmlObject().removeR(index);
}
ctRegularTextRun[0].setT(translation);
}
}
}
}
}
}
ppt.write(fos);
fos.close();
fis.close();
}catch(Exception ex){
ex.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read Images in the .doc file - java

it worked for me, if picOffset returns -1, it means there is no image for current CharacterRun

Related

Unable to read PDF using itextpdf

How can I merge the documents consisting PDFs as well as images?

Setting Password Protection on XSSF Workbook

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Retrieving content of hyperlinked slides in powerpoint files(.PPTX) through apache POI

Categories

Resources