Retrieving content of hyperlinked slides in powerpoint files(.PPTX) through apache POI

Retrieving content of hyperlinked slides in powerpoint files(.PPTX) through apache POI - java

I am trying to get the text content of powerpoint files and replace with some other text. I have a powerpoint file of 20 slides. where 13,14,15,16 slides have hyperlink to 17,18,19 and 20th slide. I am using XMLSlideshow to traverse through the slides, But it gives only 16 slides. It does not give last 4 hyperlinked slides.
Any idea really appreaciable in advance how can I get content of all hyperlinked slides and Replace by some other text.
here is my code.
public static void replaceContentInPPTX(File inputFile, File outputFile) throws IOException{
FileInputStream fis = null;
FileOutputStream fos = null;
XMLSlideShow ppt = null;
try{
fis = new FileInputStream(inputFile);
fos = new FileOutputStream(outputFile);
ppt = new XMLSlideShow(fis);
// System.out.println("Available slide layouts:"+ppt.getSlideMasters().length);
/* for(XSLFSlideMaster master : ppt.getSlideMasters()){
XSLFShape[] shape = master.getShapes();
for(XSLFSlideLayout layout : master.getSlideLayouts()){
System.out.println(layout.getType());
}
}*/
System.out.println("No of slides:"+ppt.getSlides().length); // gives 16 slides.
for(XSLFSlide slide : ppt.getSlides()) {
for(XSLFShape shape : slide){
if(shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape)shape;
for (XSLFTextParagraph xslfParagraph : txShape.getTextParagraphs()) {
String originalText = replaceUnwantedChar(xslfParagraph.getText());
if(! originalText.isEmpty()) {
String translation = "";
if(translation != null ) {
CTRegularTextRun[] ctRegularTextRun = xslfParagraph.getXmlObject().getRArray();
for(int index = ctRegularTextRun.length-1; index > 0 ; index--){
xslfParagraph.getXmlObject().removeR(index);
}
ctRegularTextRun[0].setT(translation);
}
}
}
}
}
}
ppt.write(fos);
fos.close();
fis.close();
}catch(Exception ex){
ex.printStackTrace();
}
}

Related

Convert PDF to JPG2000 file(s)

I recently started working on this project where I need to convert a PDF File into a JPEG2000 file(s) - 1 jp2 file per page -.
The goal was to replace a previous pdf to jpeg converter method we had, in order to reduce the size of the output file(s).
Based on a code I found on the internet, I made the pdftojpeg2000 converter method below, and I've been changing the setEncodingRate parameter value and comparing the results.
I managed to get smaller jpeg2000 output files, but the quality is very poor, compared to the Jpeg ones, specially for colored text or images.
Here is what my orginal pdf file looks like:
When I set setEncodingRate to 0.8 it looks like this:
My output file size is 850Ko, which is even bigger than the Jpeg (around 600Ko) ones, and lower quality.
At 0.1 setEncodingRate, the file size is considerably small, 111 Ko, but basically unreadable.
So basically what I'm trying to get here is smaller output files ( <600K ) with a better quality, And I'm wondering if it is feasible with the Jpeg2000 format.
public class ImageConverter {
public void compressor(String inputFile, String outputFile) throws IOException {
J2KImageWriteParam iwp = new J2KImageWriteParam();
PDDocument document = PDDocument.load(new File (inputFile), MemoryUsageSetting.setupMixed(10485760L));
PDFRenderer pdfRenderer = new PDFRenderer(document);
int nbPages = document.getNumberOfPages();
int pageCounter = 0;
BufferedImage image;
for (PDPage page : document.getPages()) {
if (page.hasContents()) {
image = pdfRenderer.renderImageWithDPI(pageCounter, 300, ImageType.RGB);
if (image == null)
{
System.out.println("If no registered ImageReader claims to be able to read the resulting stream");
}
Iterator writers = ImageIO.getImageWritersByFormatName("JPEG2000");
String name = null;
ImageWriter writer = null;
while (name != "com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageWriter") {
writer = (ImageWriter) writers.next();
name = writer.getClass().getName();
System.out.println(name);
}
File f = new File(outputFile+"_"+pageCounter+".jp2");
long s = System.currentTimeMillis();
ImageOutputStream ios = ImageIO.createImageOutputStream(f);
writer.setOutput(ios);
J2KImageWriteParam param = (J2KImageWriteParam) writer.getDefaultWriteParam();
IIOImage ioimage = new IIOImage(image, null, null);
param.setSOP(true);
param.setWriteCodeStreamOnly(true);
param.setProgressionType("layer");
param.setLossless(true);
param.setCompressionMode(J2KImageWriteParam.MODE_EXPLICIT);
param.setCompressionType("JPEG2000");
param.setCompressionQuality(0.01f);
param.setEncodingRate(1.01);
param.setFilter(J2KImageWriteParam.FILTER_53 );
writer.write(null, ioimage, param);
System.out.println(System.currentTimeMillis() - s);
writer.dispose();
ios.flush();
ios.close();
image.flush();
pageCounter++;
}
}
}
public static void main(String[] args) {
String input = "E:/IMGTEST/mail-DOC0002.pdf";
String output = "E:/IMGTEST/mail-DOC0002/docamail-DOC0002-";
ImageConverter imgcv = new ImageConverter();
try {
imgcv.compressor(input, output);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

How can I merge the documents consisting PDFs as well as images?

In my current code I am merging the documents consisting PDF files.
public static void appApplicantDownload(File file) {
Connection con = getConnection();
Scanner sc = new Scanner(System.in);
List < InputStream > list = new ArrayList < InputStream > ();
try {
OutputStream fOriginal = new FileOutputStream(file, true); // original
list.add(new FileInputStream(file1));
list.add(new FileInputStream(file2));
list.add(new FileInputStream(file3));
doMerge(list, fOriginal);
} catch (Exception e) {
}
}
public static void doMerge(List < InputStream > list, OutputStream outputStream) throws DocumentException, IOException {
try {
System.out.println("Merging...");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
for (InputStream in : list) {
ByteArrayOutputStream b = new ByteArrayOutputStream();
IOUtils.copy( in , b);
PdfReader reader = new PdfReader(b.toByteArray());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
copy.addPage(copy.getImportedPage(reader, i));
}
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
But now I want to change the code such that it should allow Image as well as PDFs to be merged. Above code is giving me the error that No PDF Signature found

First, you have to know somehow if the file is a PDF or an image. The easiest way would be to use the extension of the file. So you would have get extension of the file and then pass this information to your doMerge method. Do achieve that, I would change your current method
public static void doMerge(List<InputStream> list, OutputStream outputStream)
For something like
public static void doMerge(Map<InputStream, String> files, OutputStream outputStream)
So each InputStream is associated with a extension.
Second, you have to load images and pdf seperately. So your loop should look like
for (InputStream in : files.keySet()) {
String ext = files.get(in);
if(ext.equalsIgnoreCase("pdf"))
{
//load pdf here
}
else if(ext.equalsIgnoreCase("png") || ext.equalsIgnoreCase("jpg"))
{
//load image here
}
}
In java, you can easily load an Image using ImageIO. Look at this question for more details on this topic: Load image from a filepath via BufferedImage
Then to add your image into the PDF, use the PdfWriter
PdfWriter pw = PdfWriter.GetInstance(doc, outputStream);
Image img = Image.GetInstance(inputStream);
doc.Add(img);
doc.NewPage();
If you want to convert your image into PDF before and merge after you also can do that but you just have to use the PdfWriter to write them all first.

Create a new page in the opened PDF document and use the function for inserting images to place the image on that page.

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.

This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks

I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}

Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}

I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}

Get Image from the document using Apache POI

I am using Apache Poi to read images from docx.
Here is my code:
enter code here
public Image ReadImg(int imageid) throws IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("import.docx"));
BufferedImage jpg = null;
List<XWPFPictureData> pic = doc.getAllPictures();
XWPFPictureData pict = pic.get(imageid);
String extract = pict.suggestFileExtension();
byte[] data = pict.getData();
//try to read image data using javax.imageio.* (JDK 1.4+)
jpg = ImageIO.read(new ByteArrayInputStream(data));
return jpg;
}
It reads images properly but not in order wise.
For example, if document contains
image1.jpeg
image2.jpeg
image3.jpeg
image4.jpeg
image5.jpeg
It reads
image4
image3
image1
image5
image2
Could you please help me to resolve it?
I want to read the images order wise.
Thanks,
Sithik

public static void extractImages(XWPFDocument docx) {
try {
List<XWPFPictureData> piclist = docx.getAllPictures();
// traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator = piclist.iterator();
int i = 0;
while (iterator.hasNext()) {
XWPFPictureData pic = iterator.next();
byte[] bytepic = pic.getData();
BufferedImage imag = ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("D:/imagefromword/" + pic.getFileName()));
i++;
}
} catch (Exception e) {
System.exit(-1);
}
}

How to read Images in the .doc file

How to read images in the ms-office .doc file using Apache poi? I have tried with the following code but it is not working.
try {
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\ImageDocument.doc"));
Document document = new Document();
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/ImageDocumentPDF.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
HWPFDocument hdocument=new HWPFDocument(fs);
Range range=hdocument.getOverallRange();
PdfPTable createTable;
CharacterRun run;
PicturesTable picture=hdocument.getPicturesTable();
int picoffset=run.getPicOffset();
for(int i=0;i<range.numParagraphs();i++) {
run =range.getCharacterRun(i);
if(picture.hasPicture(run)) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
}
}
When i execute the above code and prints the picture offset value it displays -1
and when print picture.hasPicture(run) it returns false though the input file has an image.
Please help me to find the solution.
Thank you

public static List<byte[]> extractImagesFromWord(File file) {
if (file.exists()) {
try {
List<byte[]> result = new ArrayList<byte[]>();
if ("docx".equals(getMimeType(file).getExtension())) {
org.apache.poi.xwpf.usermodel.XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
for (org.apache.poi.xwpf.usermodel.XWPFPictureData picture : doc.getAllPictures()) {
result.add(picture.getData());
}
} else if ("doc".equals(getMimeType(file).getExtension())) {
org.apache.poi.hwpf.HWPFDocument doc = new HWPFDocument(new FileInputStream(file));
for (org.apache.poi.hwpf.usermodel.Picture picture : doc.getPicturesTable().getAllPictures()) {
result.add(picture.getContent());
}
}
return result;
} catch (Exception e) {
throw new RuntimeException( e);
}
}
return null;
}

it worked for me, if picOffset returns -1, it means there is no image for current CharacterRun

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Retrieving content of hyperlinked slides in powerpoint files(.PPTX) through apache POI - java

Related

Convert PDF to JPG2000 file(s)

How can I merge the documents consisting PDFs as well as images?

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Get Image from the document using Apache POI

How to read Images in the .doc file

Categories

Resources