JAVA: Extract Footer Images from a docx document - java

I've a task of extracting all the images from a docx file. I am ussing the snippet below for the same. I am using the Apache POI api for the same.
`File file = new File(InputFileString);
FileInputStream fs = new FileInputStream(file.getAbsolutePath());
//FileInputStream fs=new FileInputStream(src);
//create office word 2007+ document object to wrap the word file
XWPFDocument doc1x=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=doc1x.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
int i=0;
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("C:/imagefromword"+i+".jpg"));
i++;
}`
However, this code cannot detect any images which are in the footer or header section of the document.
I've extensively used my google skills and couldn't come up with anything useful.
Is there anyway to capture the image file in the footer section of the
docx file?

I am no expert on Apache POI issues, but a simple search came up with this code:
package com.concretepage;
import java.io.FileInputStream;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFFooter;
import org.apache.poi.xwpf.usermodel.XWPFHeader;
public class ReadDOCXHeaderFooter {
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);
//read header
XWPFHeader header = policy.getDefaultHeader();
System.out.println(header.getText());
//read footer
XWPFFooter footer = policy.getDefaultFooter();
System.out.println(footer.getText());
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
And the documentation page of XWPFHeaderFooter (which is the direct father class of the XWPFFooter class in the above example...) shows the same getAllPictures method you used to iterate over all the pictures in the documents body.
I on mobile, so I haven't really tested anything - but it seems straight-forward enough to work.
Good luck!

Related

Which Java Library can help to convert an MS Word Document containing shapes and Word drawings to PDF?

I am trying to convert files with .docx extension to PDF using Java. I need to convert files with shapes and drawings in MS Word. Which libraries(open source or licensed) will serve the purpose?
Currently I have been using "org.apache.poi.xwpf.converter.pdf.PdfConverter" for the purpose, but it skips to convert the shapes or drawings in my Word Document. I am unable to test it using Aspose.words. Any help with that will also be appreciated.
The method I used for conversion is:
public static void createPDFFromIMG(String sSourceFilePath,String sFileName, String sDestinationFilePath) throws Exception {
logger.debug("Entered into createPDFFromIMG()\n");
logger.info("### Started PDF Conversion..");
System.out.println("### Started PDF Conversion..");
try {
if(sFileName.contains(".docx")) {
InputStream doc = new FileInputStream(new File(sSourceFilePath));
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(
new File(sDestinationFilePath + "/" + sFileName.split("\\.")[0] + ".pdf"));
PdfConverter.getInstance().convert(document, out, options);
doc.close();
out.close();
System.out.println("### Completed PDF Conversion..");
logger.info("### Completed PDF Conversion..");
logger.debug("Exited from createPDFFromIMG()");
return;
}
}
I expect the complete Word file to be converted to PDF, but the file converted using the mentioned Java library does not contain drawings or shapes present in the docx file.
It is not actually clear why you unable to test with Aspose.Words. Code is quite simple
Document doc = new Document("in.docx");
Doc.save("out.pdf");
Also you can test with free Aspose App (which is actually based on Aspose.Words)
https://products.aspose.app/words/conversion

Using POI To read/write a doc with the full POIFSFileSystem

I have the following issue, as everybody it seems, I want to replace some items with others in Word doc.
Issue with the issue is, the doc contains headers and footers which are part of the POIFSFileSystem (I know this because reading the FS / writing the doc back -without any changes- loses these informations, whereas reading the FS / writing it back as a new file doesn't).
Currently I do this :
POIFSFileSystem pfs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(pfs);
Range r1 = document.getRange();
…
document.write();
ByteArrayOutputStream bos = new ByteArrayOutputStream(50000);
pfs.writeFilesystem(bos);
pfs.close();
However this fails, with this error:
Opened read-only or via an InputStream, a Writeable File is required
If I don't rewrite the document, it works fine, but my changes are lost.
The other way around if I only save the document, not the filesystem, I lose the header/footer.
Now the problem is, how can I update the document while "saving as" the entire filesystem, or is there a way to force the document to contain everything from the file system?
The HWPF stuff is always in scratchpad because the DOC binary file format is the most horrible of all the Horrible formats. So it will really not be ready and also will be buggy in many cases.
But in your special case, your observations are not reproducible. Using apache poi 4.0.1 the HWPFDocument contains the header story, which also contains the footer stories, after creating from *.doc file. So the following works for me:
Source:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCWithHeaderFooter {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOCWithHeaderFooter.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
for (int p = 0; p < bodyRange.numParagraphs(); p++) {
System.out.println(bodyRange.getParagraph(p).text());
if (bodyRange.getParagraph(p).text().contains("<<NAME>>"))
bodyRange.getParagraph(p).replaceText("<<NAME>>", "Axel Richter");
if (bodyRange.getParagraph(p).text().contains("<<DATE>>"))
bodyRange.getParagraph(p).replaceText("<<DATE>>", "12/21/1964");
if (bodyRange.getParagraph(p).text().contains("<<AMOUNT>>"))
bodyRange.getParagraph(p).replaceText("<<AMOUNT>>", "1,234.56");
System.out.println(bodyRange.getParagraph(p).text());
}
System.out.println("==============================================================================");
Range overallRange = document.getOverallRange();
System.out.println(overallRange);
for (int p = 0; p < overallRange.numParagraphs(); p++) {
System.out.println(overallRange.getParagraph(p).text()); // contains all inclusive header and footer
}
FileOutputStream out = new FileOutputStream("ResultDOCWithHeaderFooter.doc");
document.write(out);
out.close();
document.close();
}
}
Result:
So please do checking it again and tell us exactly what is not working for you. Because we need reproducing that, please do providing a minimal, complete, and verifiable example as I have done with my code.

How to get pictures with names from an xls file using Apache POI

Using workbook.getAllPictures() I can get an array of picture data but unfortunately it is only the data and those objects have no methods for accessing the name of the picture or any other related information.
There is a HSSFPicture class which would contain all the details of the picture but how to get for example an array of those objects from the xls?
Update:
Found SO question How can I find a cell, which contain a picture in apache poi which has a method for looping through all the pictures in the worksheet. That works.
Now that I was able to try the HSSFPicture class I found out that the getFileName() method is returning the file name without the extension. I can use the getPictureData().suggestFileExtension() to get a suggested file extension but I really would need to get the extension the picture had when it was added into the xls file. Would there be a way to get it?
Update 2:
The pictures are added into the xls with a macro. This is the part of macro that is adding the images into the sheet. fname is the full path and imageName is the file name, both are including the extension.
Set img = Sheets("Receipt images").Pictures.Insert(fname)
img.Left = 10
img.top = top + 10
img.Name = imageName
Set img = Nothing
The routine to check if the picture already exists in the Excel file.
For Each img In Sheets("Receipt images").Shapes
If img.Name = imageName Then
Set foundImage = img
Exit For
End If
Next
This recognizes that "image.jpg" is different from "image.gif", so the img.Name includes the extension.
The shape names are not in the default POI objects. So if we need them we have to deal with the underlying objects. That is for the shapes in HSSF mainly the EscherAggregate (http://poi.apache.org/apidocs/org/apache/poi/hssf/record/EscherAggregate.html) which we can get from the sheet. From its parent class AbstractEscherHolderRecord we can get all EscherOptRecords which contains the options of the shapes. In those options are also to find the groupshape.shapenames.
My example is not the complete solution. It is only provided to show which objects could be used to achieve this.
Example:
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.ss.usermodel.*;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.hssf.record.*;
import org.apache.poi.ddf.*;
import java.util.List;
import java.util.ArrayList;
class ShapeNameTestHSSF {
public static void main(String[] args) {
try {
InputStream inp = new FileInputStream("workbook1.xls");
Workbook wb = WorkbookFactory.create(inp);
Sheet sheet = wb.getSheetAt(0);
EscherAggregate escherAggregate = ((HSSFSheet)sheet).getDrawingEscherAggregate();
EscherContainerRecord escherContainer = escherAggregate.getEscherContainer().getChildContainers().get(0);
//throws java.lang.NullPointerException if no Container present
List<EscherRecord> escherOptRecords = new ArrayList<EscherRecord>();
escherContainer.getRecordsById(EscherOptRecord.RECORD_ID, escherOptRecords);
for (EscherRecord escherOptRecord : escherOptRecords) {
for (EscherProperty escherProperty : ((EscherOptRecord)escherOptRecord).getEscherProperties()) {
System.out.println(escherProperty.getName());
if (escherProperty.isComplex()) {
System.out.println(new String(((EscherComplexProperty)escherProperty).getComplexData(), "UTF-16LE"));
} else {
if (escherProperty.isBlipId()) System.out.print("BlipId = ImageId = ");
System.out.println(((EscherSimpleProperty)escherProperty).getPropertyValue());
}
System.out.println("=============================");
}
System.out.println(":::::::::::::::::::::::::::::");
}
FileOutputStream fileOut = new FileOutputStream("workbook1.xls");
wb.write(fileOut);
fileOut.flush();
fileOut.close();
} catch (InvalidFormatException ifex) {
} catch (FileNotFoundException fnfex) {
} catch (IOException ioex) {
}
}
}
Again: This is not a ready to use solution. A ready to use solution cannot be provided here, because of the complexity of the EscherRecords. Maybe to get the correct EscherRecords for the image shapes and their related EscherOptRecords, you have recursive to loop through all EscherRecords in the EscherAggregate checking whether they are ContainerRecords and if so loop through its children and so on.
Start here:
http://poi.apache.org/spreadsheet/quick-guide.html#Images
this tutorial can help you to extract an image's information from an xls spreadsheet using Apache POI

Cannot convert Gujarati PDF(unicode) to text

I am trying to read a PDF file of Gujarat electoral roll (sample file). I need to extract all the information in a structured format. I am using pdfbox from Apache to extract text from the PDF file.
The problem that I am facing is that certain characters are getting lost in the conversion and there is a lot of noise in the converted text. Please find the converted file here.
The code
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
public class Main {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
File input = new File("myPDF_manual.pdf");
File output = new File("newPaperTestFile.txt"); // The text file where you are going to store the extracted data
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
wr.close();
System.out.println(" file processed.");
}
} catch (Exception e){
e.printStackTrace();
}
}
}
I also tried the code using getText() method of PDFTextStripper class but the result is same.
I also tried to convert the pdf to xml using pdftohtml command line utility for linux. But there also some of the information is still lost. The xml file can be found here
Please suggest me any solution to solve this problem. Solution doesn't need to be Java specific.

Creating tables in a MS Word file using Java

I want to create a table in a Microsoft Office Word file using Java. Can anybody tell me how to do it with an example?
Have a look at Apache POI
The POI project is the master project
for developing pure Java ports of file
formats based on Microsoft's OLE 2
Compound Document Format. OLE 2
Compound Document Format is used by
Microsoft Office Documents, as well as
by programs using MFC property sets to
serialize their document objects.
I've never seen it done, and I work in Word a lot. If you really want to programatically do something in a word document then I'd advise using Microsoft's scripting language VBA which is specifically designed for this purpose. In fact, I'm working in it right now.
If you're working under Open Office then they have a very similar set of macro-powered tools for doing the same thing.
Office 2003 has an xml format, and the default document format for office 2007 is xml (zipped). So you could just generate xml from java. If you open an existing document it's not too hard too see the xml required.
Alternatively, you could use openoffice's api to generate a document, and save it as a ms-word document.
This snippet can be used to create a table dynamically in MS Word document.
WPFDocument document = new XWPFDocument();
XWPFTable tableTwo = document.createTable();
XWPFTableRow tableTwoRowOne = tableTwo.getRow(0);
tableTwoRowOne.getCell(0).setText(Knode1);
tableTwoRowOne.createCell().setText(tags.get("node1").toString());
for (int i = 1; i < nodeList.length; i++) {
String node = "node";
String nodeVal = "";
XWPFTableRow tr = null;
node = node + (i + 1);
nodeVal = tags.get(node).toString();
if (tr == null) {
tr = tableTwo.createRow();
tr.getCell(0).setText(nodeList[i]);
tr.getCell(1).setText(tags.get(node).toString());
}
}
Our feature set is to hit a button in our web app and get the page you are looking at back as a Word document. We use the docx schema for description of documents and have a bunch of Java code on the server side which does the document creation and response back to our web client. The formatting itself is done with some compiled xsl-t's from within Java to translate from our own XML persistence tier.
The docx schema is pretty hard to understand. The way we made most progress was to create template docx's in Word with exactly the formatting that we needed but with bogus content. We then fooled around with them until we understood exactly what was going on. There is a huge amount in the docx that you don't really need to worry about. When reading / translating the docx Word is pretty tolerant to a partially complete formatting schema. In fact we chose to strip out pretty much all the formatting because it also means that the user's default formatting takes precedence, which they seem to prefer. It also makes the xsl process faster and the resulting document smaller.
I manage the docx4j project
docx4j contains a class TblFactory, which creates regular tables (ie no row or column spans), with the default settings which Word 2007 would create, and with the dimensions specified by the user.
If you want a more complex table, the easiest approach is to create it in Word, then copy the resulting XML into a String in your IDE, where you can use docx4j's XmlUtils.unmarshalString to create a Tbl object from it.
Using my little zip utility, you can create docx with ease, if you know what you're doing. Word's DOCX file format is simply zip (folders with xml files). By using java zip utilities, you can modify existing docx, just the content part.
For the following sample to work, simply open Word, enter few lines, save document. Then with zip program, remove file word/document.xml (this is file where main content of the Word document is residing) from the zip. Now you have the template prepared. Save modified zip.
Here is what creation of new Word file looks:
/* docx file head */
final String DOCUMENT_XML_HEAD =
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?>" +
"<w:document xmlns:wpc=\"http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas\" xmlns:mc=\"http://schemas.openxmlformats.org/markup-compatibility/2006\" xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:r=\"http://schemas.openxmlformats.org/officeDocument/2006/relationships\" xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\" xmlns:v=\"urn:schemas-microsoft-com:vml\" xmlns:wp14=\"http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing\" xmlns:wp=\"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing\" xmlns:w10=\"urn:schemas-microsoft-com:office:word\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" xmlns:w14=\"http://schemas.microsoft.com/office/word/2010/wordml\" xmlns:w15=\"http://schemas.microsoft.com/office/word/2012/wordml\" xmlns:wpg=\"http://schemas.microsoft.com/office/word/2010/wordprocessingGroup\" xmlns:wpi=\"http://schemas.microsoft.com/office/word/2010/wordprocessingInk\" xmlns:wne=\"http://schemas.microsoft.com/office/word/2006/wordml\" xmlns:wps=\"http://schemas.microsoft.com/office/word/2010/wordprocessingShape\" mc:Ignorable=\"w14 w15 wp14\">" +
"<w:body>";
/* docx file foot */
final String DOCUMENT_XML_FOOT =
"</w:body>" +
"</w:document>";
final ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("c:\\TEMP\\test.docx"));
final String fullDocumentXmlContent = DOCUMENT_XML_HEAD + "<w:p><w:r><w:t>Hey MS Word, hello from java.</w:t></w:r></w:p>" + DOCUMENT_XML_FOOT;
final si.gustinmi.DocxZipCreator creator = new si.gustinmi.DocxZipCreator();
// create new docx file
creator.createDocxFromExistingDocx(zos, "c:\\TEMP\\existingDocx.docx", fullDocumentXmlContent);
These are zip utilities:
package si.gustinmi;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.logging.Logger;
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;
/**
* Creates new docx from existing one.
* #author gustinmi [at] gmail [dot] com
*/
public class DocxZipCreator {
public static final Logger log = Logger.getLogger(DocxZipCreator.class.getCanonicalName());
private static final int BUFFER_SIZE = 4096;
/** OnTheFly zip creator. Traverses through existing docx zip and creates new one simultaneousl.
* On the end, custom document.xml is inserted inside
* #param zipFilePath location of existing docx template (without word/document.xml)
* #param documentXmlContent content of the word/document.xml
* #throws IOException
*/
public void createDocxFromExistingDocx(ZipOutputStream zos, String zipFilePath, String documentXmlContent) throws IOException {
final FileInputStream fis = new FileInputStream(zipFilePath);
final ZipInputStream zipIn = new ZipInputStream(fis);
try{
log.info("Starting to create new docx zip");
ZipEntry entry = zipIn.getNextEntry();
while (entry != null) { // iterates over entries in the zip file
copyEntryfromZipToZip(zipIn, zos, entry.getName());
zipIn.closeEntry();
entry = zipIn.getNextEntry();
}
// add document.xml to existing zip
addZipEntry(documentXmlContent, zos, "word/document.xml");
}finally{
zipIn.close();
zos.close();
log.info("End of docx creation");
}
}
/** Copies sin gle entry from zip to zip */
public void copyEntryfromZipToZip(ZipInputStream is, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
/** Create new zip entry with content
* #param content content of a new zip entry
* #param zos
* #param entryName name (npr: word/document.xml)
*/
public void addZipEntry(String content, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final InputStream is = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
}
Office Writer would be a better tool to use than POI for your requirement.
If all you want is a simple table without too much of formatting, I would use this simple trick. Use Java to generate the table as HTML using plain old table,tr,td tags and copy the rendered HTML table into the word document ;)
Click here for a Working example with source code.
This example generates MS-Word docs from Java, based on a template concept.

Categories