How can I get footnotes and paragraphs from Apache POI? - java

I have code to get paragraphs from a .doc file in Apache POI, but I'd like to get footnotes also. Also, is this the only way to get paragraphs?
Code so far:
InputStream stream = ...
HWPFDocument document = new HWPFDocument(stream);
Range range = document.getRange();
StyleSheet stylesheet = document.getStyleSheet();
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph paragraph = range.getParagraph(i);
String text = paragraph.text();
}
Any ideas?

You could try this...
WordExtractor extractor = new WordExtractor(document);
paragraphs.addAll(Arrays.asList(extractor.getParagraphText()) );
footnotes.addAll(Arrays.asList(extractor.getFootnoteText()) );
extractor.close();

Related

How to use iText to parse paths (such as lines in the document)

I am using iText to parse text in a PDF document, and i am using PdfContentStreamProcessor with a RenderListener. Such as:
PdfReader reader = new PdfReader(file.toURI().toURL());
int numberOfPages = reader.getNumberOfPages();
MyRenderListener listener = new MyRenderListener ();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
PdfDictionary pageDic = reader.getPageN(pageNumber);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
Rectangle pageSize = reader.getPageSize(pageNumber);
listener.startPage(pageNumber, pageSize);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
}
I have no problem to get the text with the renderText(TextRenderInfo) method, but how do I parse the graphic content appart from images? For example in my case I would like to get:
Text content which is in a box
Horizontal lines
Per mkl comment, by using ExtRenderListener I am able to get the geometries. I used How to extract the color of a rectangle in a PDF, with iText for reference

Apache poi replace existing picture on header

Is there any way to replace an image on word(docx) file header by name of the image with apache poi? I'am thinking about that:
+--------------------------------+
+HEADER myimage.jpeg-+
+ -----------BODY------------+
+--------------------------------+
replaceImage("myimage.jpeg", newPictureInputStream,
"newPicture_name.jpeg");
Here what I tried:
XWPFParagraph originalParagraph = null;
originalParagraph = getPictureParagraphInHead(lookingPictureName);
ListIterator<XWPFRun> it = originalParagraph.getRuns().listIterator();
XWPFRun replacedRun = null;
while (it.hasNext()) {
XWPFRun run = it.next();
int runIDX = it.nextIndex();
if (run.getEmbeddedPictures().size() > 0) {
XWPFRun newRun = null;
newRun = new XWPFRun(run.getCTR(), (IRunBody) originalParagraph);
originalParagraph.addRun(newRun);
originalParagraph.removeRun(originalParagraph.getRuns().indexOf(run));
break;
}
}
I'm not sure if you can get the "filename" of the image with POI. It's probably in the XML so you might have to make your own method for finding the image.
To get the Header you do:
XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(doc); // XWPFDocument
XWPFHeader header = policy.getDefaultHeader();
And to delete the images, get the XWPFRun from your paragraph (cell/row/table..)
CTR ctr = myRun.getCTR(); //
List<CTDrawing> images = ctr.getDrawingList();
for (int i=0; i<images.size(); i++)
{
ctr.removeDrawing(i);
}

Unable to read entire cell of a .doc file using Apache POI

In one of my projects I need to read images from a .doc file using Apache POI. For each row there is a cell containing an images(one, two, three, etc. ) which I need to read out along side with text data.
So I tried the following code
FileInputStream fileInputStream = new FileInputStream(file);
POIFSFileSystem poifsFileSystem = new POIFSFileSystem(fileInputStream);
HWPFDocument doc = new HWPFDocument(poifsFileSystem);
Range range = doc.getRange();
PicturesTable pictureTable = doc.getPicturesTable();
PicturesSource pictures = new PicturesSource(doc);
Paragraph tableParagraph = range.getParagraph(0);
Table table = range.getTable(tableParagraph);
TableRow row = table.getRow(0);
TableCell cell1 = row.getCell(0);
for (int j = 0; j < cell1.getParagraph(0).numCharacterRuns(); j++) {
CharacterRun cr = cell1.getParagraph(0).getCharacterRun(j);
if (pictureTable.hasPicture(cr)) {
logger.debug("Has picture If--");
Picture picture = pictures.getFor(cr);
logger.debug("pictures Description--" + picture.getDescription());
}
}
Now I am able to read images of a particular cell, but the problem is I am not able to read all the images of a cell means, I am able to read image before the text and image in between the text, but I am not able to read the image which is followed by the text. Example "image_1---some text---image_2 some text---.image_3". Now in this case I am not able to read image_3 only. What should I do, So I can read image_3 also. I searched a lot but no luck till now. Hope someone knows the way to do this. Thanks in Advance.
With the HWPFDocument, I am having problems, too. If you have a chance to change the Word documents to docx before processing, here's an example that works with XWPFDocuments:
FileInputStream fileInputStream = new FileInputStream(file);
XWPFDocument doc = new XWPFDocument(fileInputStream);
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph para : cell.getParagraphs()) {
for (XWPFRun run : para.getRuns()) {
for (XWPFPicture pic : run.getEmbeddedPictures()) {
System.out.println(pic.getPictureData());
}
}
}
}
}
}

Insert .pdf doc or .png image content into a .docx file using java

How can I insert pdf or png content into a docx file using java?
I've tried using Apache POI API in the following way, but it is not working (it generates some junk doc file):
XWPFDocument doc = new XWPFDocument();
String pdf = "D://capture1.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
}
FileOutputStream out1 = new FileOutputStream("D://javadomain1.docx");
doc.write(out1);
out1.close();
reader.close();
System.out.println("Document converted successfully");
You should be able to do it with POI, and you can certainly do it using docx4j.
Here's sample code for inserting an image using docx4j.
Note that to "insert a PDF", you need to OLE embed it. That's more difficult, since you need to convert the PDF to a suitable binary OLE object. In docx4j, helper code for doing this is part of the commercial Enterprise edition.

Find a table in word and write in that table using java

I have a word document which may have n number of tables. The table is identified by the table name which is written in the 1st cell as heading. Now i have to find the table with table name and write in one of the cell of that table. I tried using apache-poi for the same but unable to figure out how to use it for my purpose. Please refer to the attached screen shot, if i am not able to explain how the document looks like.
Thanks
String fileName = "E:\\a1.doc";
if (args.length > 0) {
fileName = args[0];
}
InputStream fis = new FileInputStream(fileName);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
for (int i=0; i<range.numParagraphs(); i++){
Paragraph tablePar = range.getParagraph(i);
if (tablePar.isInTable()) {
Table table = range.getTable(tablePar);
for (int rowIdx=0; rowIdx<table.numRows(); rowIdx++) {
for (int colIdx=0; colIdx<row.numCells(); colIdx++) {
TableCell cell = row.getCell(colIdx);
System.out.println("column="+cell.getParagraph(0).text());
}
}
}
}
this is what i have tried, but this reads only the 1st table.
I've found u get misunderstanding in poi.
If u just meant to read a table.Just use the TableIterator to fetch the table's content or u will get an exception with not start of table.
I suppose there is only one paragraph in every table cell.
InputStream fis = new FileInputStream(fileName);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
TableIterator itr = new TableIterator(range);
while(itr.hasNext()){
Table table = itr.next();
for(int rowIndex = 0; rowIndex < table.numRows(); rowIndex++){
TableRow row = table.getRow(rowIndex);
for(int colIndex = 0; colIndex < row.numCells(); colIndex++){
TableCell cell = row.getCell(colIndex);
System.out.println(cell.getParagraph(0).text());
}
}
}
I think Apache POI is the way to go. It's not well documented, but the time spent on research how to use it may be worth it. Word document is basically a hierarchical (tree) structure which you need to traverse and find the data you need.

Categories