Read table from docx file using Apache POI

Read table from docx file using Apache POI - java

I am able to read tables from doc file. (see following code)
public String readDocFile(String filename, String str) {
try {
InputStream fis = new FileInputStream(filename);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
boolean intable = false;
boolean inrow = false;
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph par = range.getParagraph(i);
//System.out.println("paragraph "+(i+1));
//System.out.println("is in table: "+par.isInTable());
//System.out.println("is table row end: "+par.isTableRowEnd());
//System.out.println(par.text());
if (par.isInTable()) {
if (!intable) {//System.out.println("New table creating"+intable);
str += "<table border='1'>";
intable = true;
}
if (!inrow) {//System.out.println("New row creating"+inrow);
str += "<tr>";
inrow = true;
}
if (par.isTableRowEnd()) {
inrow = false;
} else {
//System.out.println("New text adding"+par.text());
str += "<td>" + par.text() + "</td>";
}
} else {
if (inrow) {//System.out.println("Closing Row");
str += "</tr>";
inrow = false;
}
if (intable) {//System.out.println("Closing Table");
str += "</table>";
intable = false;
}
str += par.text() + "<br/>";
}
}
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return str;
}
Can anyone suggest me how can I do the same with docx file ?
I tried to do that. But could not locate a replacement of 'Range' class.
Please help.

By popular request, promoting a comment to an answer...
In the Apache POI code examples, you can find the XWPF SimpleTable example
This shows how to create a simple table, and how to create one with lots of fancy styling.
Assuming you just want a simple table from scratch, in a brand new workbook, then the code you need goes along the lines of:
// Start with a new document
XWPFDocument doc = new XWPFDocument();
// Add a 3 column, 3 row table
XWPFTable table = doc.createTable(3, 3);
// Set some text in the middle
table.getRow(1).getCell(1).setText("EXAMPLE OF TABLE");
// table cells have a list of paragraphs; there is an initial
// paragraph created when the cell is created. If you create a
// paragraph in the document to put in the cell, it will also
// appear in the document following the table, which is probably
// not the desired result.
XWPFParagraph p1 = table.getRow(0).getCell(0).getParagraphs().get(0);
XWPFRun r1 = p1.createRun();
r1.setBold(true);
r1.setText("The quick brown fox");
r1.setItalic(true);
r1.setFontFamily("Courier");
r1.setUnderline(UnderlinePatterns.DOT_DOT_DASH);
r1.setTextPosition(100);
// And at the end
table.getRow(2).getCell(2).setText("only text");
// Save it out, to view in word
FileOutputStream out = new FileOutputStream("simpleTable.docx");
doc.write(out);
out.close();

The following snippet uses Apache POI 5.0.0, and it works well when reading docx table data
public void readDocxTables(String docxFilePath) throws FileNotFoundException, IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream(docxFilePath));
for(XWPFTable table : doc.getTables()) {
for(XWPFTableRow row : table.getRows()) {
for(XWPFTableCell cell : row.getTableCells()) {
System.out.println("cell text: " + cell.getText());
}
}
}
}

This is not Apache POI, but using third party component found it much easier.
An example how to get tables from a docx file.
Of course, just idea if you do not find solution with the POI,

Related

How to add a paragraph or text between Tables in .docx file with apache POI

I'm trying to set some paragraph or text in .docx file using Apache POI, I'm reading a .docx file used as template from WEB-INF/resources/templates folder inside my war file, once read, I want to create dynamically more tables starting after 9th table used as template, I'm able to add more tables but other type of content (paragraphs) are arranged in other section of the document ¿is there a required form to do this thing?
XWPFDocument doc = null;
try {
doc = new XWPFDocument(OPCPackage.open(request.getSession().getServletContext().getResourceAsStream("/resources/templates/twd.docx")));
} catch (Exception e) {
e.printStackTrace();
}
XWPFParagraph parrafo = null;
XWPFTable table=null;
org.apache.xmlbeans.XmlCursor cursor = null;
XWPFParagraph newParagraph = null;
XWPFRun run = null;
for(int j=0; j < 3; j++) { //create 3 more tables
table = doc.getTables().get(9);
cursor = table.getCTTbl().newCursor();
cursor.toEndToken();
if (cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START);
{
table = doc.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun()
.setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
table.getRow(0).createCell().addParagraph().createRun().setText("some text");
table.getRow(1).createCell().addParagraph().createRun().setText("some text");
table.getRow(2).createCell().addParagraph().createRun().setText("some text");
table.getRow(3).createCell().addParagraph().createRun().setText("some text");
table.getRows().get(0).getCell(0).setColor("183154");
table.getRows().get(1).getCell(0).setColor("183154");
table.getRows().get(2).getCell(0).setColor("183154");
table.getRows().get(3).getCell(0).setColor("183154");
table.getCTTbl().addNewTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
table.getCTTbl().getTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
}
//OTHER CONTENT BETWEEN CREATED TABLES (PARAGRAPHS, BREAK LINES,ETC)
doc.createParagraph().createRun().setText("text after table");
}

If you once uses a cursor, then you must using that cursor for further inserting content if you wants to be in the document part where the cursor is also. Don't believe, the document automatically will take note of the cursor you created.
So for example:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordTextAfterTable {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTextAfterTable.docx"));
XWPFTable table = document.getTables().get(9);
org.apache.xmlbeans.XmlCursor cursor = table.getCTTbl().newCursor();
cursor.toEndToken(); //now we are at end of the CTTbl
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
//now we are at next TokenType.START and insert the new table
//note: This is immediately after the table. So both tables touch each other.
table = document.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun().setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
//...
System.out.println(cursor.isEnd()); //cursor is now at the end of the new table
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
XWPFParagraph newParagraph = document.insertNewParagraph(cursor);
XWPFRun run = newParagraph.createRun();
run.setText("text after table");
document.write(new FileOutputStream("WordTextAfterTableNew.docx"));
document.close();
}
}

Read UTF-8 encoded text content inside table cell in MS-word file using Apache POI

I'm trying to read a table and exact data in a Microsoft Word document (docx file) using apache poi. The file contain UTF-8 encoded characters (Sinhala language). I'm using following code block.
FileInputStream fis = new FileInputStream("path\\to\\file.docx");
XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
IBodyElement elem = iter.next();
if (elem instanceof XWPFTable) {
List<XWPFTableRow> rows = ((XWPFTable) elem).getRows();
for(XWPFTableRow row :rows){
List<XWPFTableCell> cells = row.getTableCells();
for(XWPFTableCell cell : cells){
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(cell.getText());
}
}
}
}
But I'm not getting correct UTF-8 characters in the output console.
I have already refer several solutions including following.
How to parse UTF-8 characters in Excel files using POI | I'm trying to read a table in a Word file. So my Cell object doesn't have getStringCellValue() method.
http://www.herongyang.com/Java-Tools/native2ascii-Set-UTF-8-Encoding-in-PrintStream.html | I have already tried this solution and It's not working.
does anyone know how to read UTF-8 encoded characters in a word file using apache poi?

I found a solution with setting font for a cell (as a peragraph).
code :
private static final String FILE_NAME = "/tmp/Diskade.docx";
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(FILE_NAME);
XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
IBodyElement elem = iter.next();
if (elem instanceof XWPFTable) {
List<XWPFTableRow> rows = ((XWPFTable) elem).getRows();
for(XWPFTableRow row :rows){
List<XWPFTableCell> cells = row.getTableCells();
for(XWPFTableCell cell : cells){
String celltext = cell.getText();
XWPFParagraph paragraph = cell.addParagraph();
setRun(paragraph.createRun() , "Arial" , 10, "2b5079" , celltext , false, false);
System.out.print(cell.getParagraphs().get(0).getParagraphText() + " - ");
}
System.out.println();
}
}
}
}
private static void setRun (XWPFRun run , String fontFamily , int fontSize , String colorRGB , String text , boolean bold , boolean addBreak) {
run.setFontFamily(fontFamily);
run.setFontSize(fontSize);
run.setColor(colorRGB);
run.setText(text);
run.setBold(bold);
if (addBreak) run.addBreak();
}
EDIT :
Later I noted that, actually adding paragraph is enough. You don't need setRun method or invokin it as setRun(paragraph.createRun() , "Arial" , 10, "2b5079" , celltext , false, false);.
Will see is there anything can be done with encoding. (because, for me once the font is loaded it was working fine without paragraph also)

Replacing a text in Apache POI XWPF not working

I'm currently trying to work on the code mentioned on a previous post called Replacing a text in Apache POI XWPF.
I have tried the below and it works but I don't know if I am missing anything. When I run the code the text is not replaced but added onto the end of what was searched. For example I have created a basic word document and entered the text "test". In the below code when I run it I eventually get the new document with the text "testDOG".
I have had to change the original code from String text = r.getText(0) to String text = r.toString() because I kept getting a NullError while running the code.
import java.io.*;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class testPOI {
public static void main(String[] args) throws Exception{
String filepath = "F:\\MASTER_DOC.docx";
String outpath = "F:\\Test.docx";
XWPFDocument doc = new XWPFDocument(OPCPackage.open(filepath));
for (XWPFParagraph p : doc.getParagraphs()){
for (XWPFRun r : p.getRuns()){
String text = r.toString();
if(text.contains("test")) {
text = text.replace("test", "DOG");
r.setText(text);
}
}
}
doc.write(new FileOutputStream(outpath));
}
EDIT: Thanks for your help everyone. I browsed around and found a solution on Replace table column value in Apache POI

This method replace search Strings in paragraphs and is able to work with Strings spanning over more than one Run.
private long replaceInParagraphs(Map<String, String> replacements, List<XWPFParagraph> xwpfParagraphs) {
long count = 0;
for (XWPFParagraph paragraph : xwpfParagraphs) {
List<XWPFRun> runs = paragraph.getRuns();
for (Map.Entry<String, String> replPair : replacements.entrySet()) {
String find = replPair.getKey();
String repl = replPair.getValue();
TextSegement found = paragraph.searchText(find, new PositionInParagraph());
if ( found != null ) {
count++;
if ( found.getBeginRun() == found.getEndRun() ) {
// whole search string is in one Run
XWPFRun run = runs.get(found.getBeginRun());
String runText = run.getText(run.getTextPosition());
String replaced = runText.replace(find, repl);
run.setText(replaced, 0);
} else {
// The search string spans over more than one Run
// Put the Strings together
StringBuilder b = new StringBuilder();
for (int runPos = found.getBeginRun(); runPos <= found.getEndRun(); runPos++) {
XWPFRun run = runs.get(runPos);
b.append(run.getText(run.getTextPosition()));
}
String connectedRuns = b.toString();
String replaced = connectedRuns.replace(find, repl);
// The first Run receives the replaced String of all connected Runs
XWPFRun partOne = runs.get(found.getBeginRun());
partOne.setText(replaced, 0);
// Removing the text in the other Runs.
for (int runPos = found.getBeginRun()+1; runPos <= found.getEndRun(); runPos++) {
XWPFRun partNext = runs.get(runPos);
partNext.setText("", 0);
}
}
}
}
}
return count;
}

Your logic is not quite right. You need to collate all the text in the runs first and then do the replace. You also need to remove all runs for the paragraph and add a new single run if a match on "test" is found.
Try this instead:
public class testPOI {
public static void main(String[] args) throws Exception{
String filepath = "F:\\MASTER_DOC.docx";
String outpath = "F:\\Test.docx";
XWPFDocument doc = new XWPFDocument(new FileInputStream(filepath));
for (XWPFParagraph p : doc.getParagraphs()){
int numberOfRuns = p.getRuns().size();
// Collate text of all runs
StringBuilder sb = new StringBuilder();
for (XWPFRun r : p.getRuns()){
int pos = r.getTextPosition();
if(r.getText(pos) != null) {
sb.append(r.getText(pos));
}
}
// Continue if there is text and contains "test"
if(sb.length() > 0 && sb.toString().contains("test")) {
// Remove all existing runs
for(int i = 0; i < numberOfRuns; i++) {
p.removeRun(i);
}
String text = sb.toString().replace("test", "DOG");
// Add new run with updated text
XWPFRun run = p.createRun();
run.setText(text);
p.addRun(run);
}
}
doc.write(new FileOutputStream(outpath));
}
}

Worth noticing that, run.getPosition() returns -1 most of the cases. But it does not effect when there is only one text postion per a run. But, technically it can have any number of textPositions and I've experienced such cases. So, the best way is to getCTR () for run and terate through each the run for count of textPositions. Number of textPositions are equal to ctrRun.sizeOfTArray()
A sample code
for (XWPFRun run : p.getRuns()){
CTR ctrRun = run.getCTR();
int sizeOfCtr = ctrRun.sizeOfTArray();
for(int textPosition=0; textPosition<sizeOfCtr){
String text = run.getText(textPosition);
if(text.contains("test")) {
text = text.replace("test", "DOG");
r.setText(text,textPosition);
}
}
}

just change text for every run in your paragraph, and then save the file.
this code worked for mi
XWPFDocument doc = new XWPFDocument(new FileInputStream(filepath));
for (XWPFParagraph p : doc.getParagraphs()) {
StringBuilder sb = new StringBuilder();
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("variable1")) {
text = text.replace("variable1", "valeur1");
r.setText(text, 0);
}
if (text != null && text.contains("variable2")) {
text = text.replace("variable2", "valeur2");
r.setText(text, 0);
}
if (text != null && text.contains("variable3")) {
text = text.replace("variable3", "valeur3");
r.setText(text, 0);
}
}
}
doc.write(new FileOutputStream(outpath));

iText: split a PDF into several PDF (1 per page)

What I want is that: given a 10-pages-pdf-file, I want to display each page of that pdf inside a table on the web. What is the best way to achieve this? I guess one way is to split this 10-pages-pdf-file into 10 1-pages pdf, and programmatically display each pdf onto a row of a table. Can I do this with iText? Is there a better way to accomplish this?

From Split a PDF file (using iText)
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.pdf.PdfCopy;
import com.lowagie.text.pdf.PdfImportedPage;
import com.lowagie.text.pdf.PdfReader;
public class SplitPDFFile {
/**
* #param args
*/
public static void main(String[] args) {
try {
String inFile = args[0].toLowerCase();
System.out.println ("Reading " + inFile);
PdfReader reader = new PdfReader(inFile);
int n = reader.getNumberOfPages();
System.out.println ("Number of pages : " + n);
int i = 0;
while ( i < n ) {
String outFile = inFile.substring(0, inFile.indexOf(".pdf"))
+ "-" + String.format("%03d", i + 1) + ".pdf";
System.out.println ("Writing " + outFile);
Document document = new Document(reader.getPageSizeWithRotation(1));
PdfCopy writer = new PdfCopy(document, new FileOutputStream(outFile));
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
/* example :
java SplitPDFFile d:\temp\x\tx.pdf
Reading d:\temp\x\tx.pdf
Number of pages : 3
Writing d:\temp\x\tx-001.pdf
Writing d:\temp\x\tx-002.pdf
Writing d:\temp\x\tx-003.pdf
*/
}
}
Many iText examples here.

With PDDocument you can do so very easily.
You just have to use a Java List of PDDocument type and Splitter function to split a document.
List<PDDocument> Pages=new ArrayList<PDDocument>();
PDDocument.load(filePath);
try {
Splitter splitter = new Splitter();
Pages = splitter.split(document);
}
catch(Exception e) {
e.printStackTrace(); // print reason and line number where error exist
}

I can't comment, but this line in the most voted answer
Document document = new Document(reader.getPageSizeWithRotation(1));
should be
Document document = new Document(reader.getPageSizeWithRotation(i+1));
to get the correct pdf size if other pages have different page size (it know it's rare)

Number of pages in a word doc in java

Is there an easy way to count the number of pages is a Word document either .doc or .docx?
Thanks

You could try the Apache API for word Docs:
http://poi.apache.org/
It as a method for getting the page count:
public int getPageCount()
Returns:
The page count or 0 if the SummaryInformation does not contain a page count.

I found a really cool class, that count Pages for Word, Excel and PowerPoint. With help of Apache POI. And it is for old doc and new docx.
String lowerFilePath = filePath.toLowerCase();
if (lowerFilePath.endsWith(".xls")) {
HSSFWorkbook workbook = new HSSFWorkbook(new FileInputStream(lowerFilePath));
Integer sheetNums = workbook.getNumberOfSheets();
if (sheetNums > 0) {
return workbook.getSheetAt(0).getRowBreaks().length + 1;
}
} else if (lowerFilePath.endsWith(".xlsx")) {
XSSFWorkbook xwb = new XSSFWorkbook(lowerFilePath);
Integer sheetNums = xwb.getNumberOfSheets();
if (sheetNums > 0) {
return xwb.getSheetAt(0).getRowBreaks().length + 1;
}
} else if (lowerFilePath.endsWith(".docx")) {
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(lowerFilePath));
return docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
} else if (lowerFilePath.endsWith(".doc")) {
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(lowerFilePath));
return wordDoc.getSummaryInformation().getPageCount();
} else if (lowerFilePath.endsWith(".ppt")) {
HSLFSlideShow document = new HSLFSlideShow(new FileInputStream(lowerFilePath));
SlideShow slideShow = new SlideShow(document);
return slideShow.getSlides().length;
} else if (lowerFilePath.endsWith(".pptx")) {
XSLFSlideShow xdocument = new XSLFSlideShow(lowerFilePath);
XMLSlideShow xslideShow = new XMLSlideShow(xdocument);
return xslideShow.getSlides().length;
}
source: OfficeTools.getPageCount()

Use Apache POI's SummaryInformation to fetch the Total page count of a MS word document

//Library is aspose
//package com.aspose.words.*
/*Open the Word Document */
Document doc = new Document("C:\\Temp\\file.doc");
/*Get page count */
int pageCount = doc.getPageCount();

docx4j can get total pages as below:
org.docx4j.openpackaging.parts.DocPropsExtendedPart docPropsExtendedPart = wordMLPkg.getDocPropsExtendedPart();
org.docx4j.docProps.extended.Properties extendedProps = (org.docx4j.docProps.extended.Properties)docPropsExtendedPart.getJaxbElement();
int numPages = extendedProps.getPages();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read table from docx file using Apache POI - java

This is not Apache POI, but using third party component found it much easier. An example how to get tables from a docx file. Of course, just idea if you do not find solution with the POI,

Related

How to add a paragraph or text between Tables in .docx file with apache POI

Read UTF-8 encoded text content inside table cell in MS-word file using Apache POI

Replacing a text in Apache POI XWPF not working

iText: split a PDF into several PDF (1 per page)

Number of pages in a word doc in java

Categories

Resources