Is there an easy way to count the number of pages is a Word document either .doc or .docx?
Thanks
You could try the Apache API for word Docs:
http://poi.apache.org/
It as a method for getting the page count:
public int getPageCount()
Returns:
The page count or 0 if the SummaryInformation does not contain a page count.
I found a really cool class, that count Pages for Word, Excel and PowerPoint. With help of Apache POI. And it is for old doc and new docx.
String lowerFilePath = filePath.toLowerCase();
if (lowerFilePath.endsWith(".xls")) {
HSSFWorkbook workbook = new HSSFWorkbook(new FileInputStream(lowerFilePath));
Integer sheetNums = workbook.getNumberOfSheets();
if (sheetNums > 0) {
return workbook.getSheetAt(0).getRowBreaks().length + 1;
}
} else if (lowerFilePath.endsWith(".xlsx")) {
XSSFWorkbook xwb = new XSSFWorkbook(lowerFilePath);
Integer sheetNums = xwb.getNumberOfSheets();
if (sheetNums > 0) {
return xwb.getSheetAt(0).getRowBreaks().length + 1;
}
} else if (lowerFilePath.endsWith(".docx")) {
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(lowerFilePath));
return docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
} else if (lowerFilePath.endsWith(".doc")) {
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(lowerFilePath));
return wordDoc.getSummaryInformation().getPageCount();
} else if (lowerFilePath.endsWith(".ppt")) {
HSLFSlideShow document = new HSLFSlideShow(new FileInputStream(lowerFilePath));
SlideShow slideShow = new SlideShow(document);
return slideShow.getSlides().length;
} else if (lowerFilePath.endsWith(".pptx")) {
XSLFSlideShow xdocument = new XSLFSlideShow(lowerFilePath);
XMLSlideShow xslideShow = new XMLSlideShow(xdocument);
return xslideShow.getSlides().length;
}
source: OfficeTools.getPageCount()
Use Apache POI's SummaryInformation to fetch the Total page count of a MS word document
//Library is aspose
//package com.aspose.words.*
/*Open the Word Document */
Document doc = new Document("C:\\Temp\\file.doc");
/*Get page count */
int pageCount = doc.getPageCount();
docx4j can get total pages as below:
org.docx4j.openpackaging.parts.DocPropsExtendedPart docPropsExtendedPart = wordMLPkg.getDocPropsExtendedPart();
org.docx4j.docProps.extended.Properties extendedProps = (org.docx4j.docProps.extended.Properties)docPropsExtendedPart.getJaxbElement();
int numPages = extendedProps.getPages();
Related
I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.
You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)
I am evaluating pdfbox-1.8.6 and iText-4.2.0(free) for performance. I have noticed that content extraction is faster in iText, but searching words using regex in the content extracted by iText takes longer time than pdfbox.
Extracted content and its size seems same in both cases.
Can anybody explain me on this?
My environment is ubuntu 12.04/java 1.7.
EDIT: adding sample codes.
pdfbox
//in constructor
fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
pdfDocument = PDDocument.load(bis);
textStripper = new PDFTextStripper();
//in extarctContent method
textStripper.setStartPage(pageNo);
textStripper.setEndPage(pageNo);
return textStripper.getText(pdfDocument);
iText
//in constructor
fis = new FileInputStream(file);
reader = new PdfReader(fis);
pdfTextExtractor = new PdfTextExtractor(reader);
//in extarctContent method
return pdfTextExtractor.getTextFromPage(pageNo);
searching words
StopWatch searchTime = ...
StopWatch pdfTime = ...
for (int i = 1; i <= pageCount; i++) {
// fetch one page
pdfTime.resume();
String pageContent = pdfParserService.extractPageContent(i);
pdfTime.suspend();
if (pageContent == null) {
pageContent = "";
}
pageContent = pageContent.replace("-\n", "").replace("\n", "").replace("\\", " ");
searchTime.resume();
Collection list = searchService.search(pageContent, wordList);
searchTime.suspend();
}
pdfTime.stop();
searchTime.stop();
System.out.println(pdfTime.getTime());
System.out.println(searchTime.getTime());
I am able to read tables from doc file. (see following code)
public String readDocFile(String filename, String str) {
try {
InputStream fis = new FileInputStream(filename);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
boolean intable = false;
boolean inrow = false;
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph par = range.getParagraph(i);
//System.out.println("paragraph "+(i+1));
//System.out.println("is in table: "+par.isInTable());
//System.out.println("is table row end: "+par.isTableRowEnd());
//System.out.println(par.text());
if (par.isInTable()) {
if (!intable) {//System.out.println("New table creating"+intable);
str += "<table border='1'>";
intable = true;
}
if (!inrow) {//System.out.println("New row creating"+inrow);
str += "<tr>";
inrow = true;
}
if (par.isTableRowEnd()) {
inrow = false;
} else {
//System.out.println("New text adding"+par.text());
str += "<td>" + par.text() + "</td>";
}
} else {
if (inrow) {//System.out.println("Closing Row");
str += "</tr>";
inrow = false;
}
if (intable) {//System.out.println("Closing Table");
str += "</table>";
intable = false;
}
str += par.text() + "<br/>";
}
}
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return str;
}
Can anyone suggest me how can I do the same with docx file ?
I tried to do that. But could not locate a replacement of 'Range' class.
Please help.
By popular request, promoting a comment to an answer...
In the Apache POI code examples, you can find the XWPF SimpleTable example
This shows how to create a simple table, and how to create one with lots of fancy styling.
Assuming you just want a simple table from scratch, in a brand new workbook, then the code you need goes along the lines of:
// Start with a new document
XWPFDocument doc = new XWPFDocument();
// Add a 3 column, 3 row table
XWPFTable table = doc.createTable(3, 3);
// Set some text in the middle
table.getRow(1).getCell(1).setText("EXAMPLE OF TABLE");
// table cells have a list of paragraphs; there is an initial
// paragraph created when the cell is created. If you create a
// paragraph in the document to put in the cell, it will also
// appear in the document following the table, which is probably
// not the desired result.
XWPFParagraph p1 = table.getRow(0).getCell(0).getParagraphs().get(0);
XWPFRun r1 = p1.createRun();
r1.setBold(true);
r1.setText("The quick brown fox");
r1.setItalic(true);
r1.setFontFamily("Courier");
r1.setUnderline(UnderlinePatterns.DOT_DOT_DASH);
r1.setTextPosition(100);
// And at the end
table.getRow(2).getCell(2).setText("only text");
// Save it out, to view in word
FileOutputStream out = new FileOutputStream("simpleTable.docx");
doc.write(out);
out.close();
The following snippet uses Apache POI 5.0.0, and it works well when reading docx table data
public void readDocxTables(String docxFilePath) throws FileNotFoundException, IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream(docxFilePath));
for(XWPFTable table : doc.getTables()) {
for(XWPFTableRow row : table.getRows()) {
for(XWPFTableCell cell : row.getTableCells()) {
System.out.println("cell text: " + cell.getText());
}
}
}
}
This is not Apache POI, but using third party component found it much easier.
An example how to get tables from a docx file.
Of course, just idea if you do not find solution with the POI,
What I want is that: given a 10-pages-pdf-file, I want to display each page of that pdf inside a table on the web. What is the best way to achieve this? I guess one way is to split this 10-pages-pdf-file into 10 1-pages pdf, and programmatically display each pdf onto a row of a table. Can I do this with iText? Is there a better way to accomplish this?
From Split a PDF file (using iText)
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.pdf.PdfCopy;
import com.lowagie.text.pdf.PdfImportedPage;
import com.lowagie.text.pdf.PdfReader;
public class SplitPDFFile {
/**
* #param args
*/
public static void main(String[] args) {
try {
String inFile = args[0].toLowerCase();
System.out.println ("Reading " + inFile);
PdfReader reader = new PdfReader(inFile);
int n = reader.getNumberOfPages();
System.out.println ("Number of pages : " + n);
int i = 0;
while ( i < n ) {
String outFile = inFile.substring(0, inFile.indexOf(".pdf"))
+ "-" + String.format("%03d", i + 1) + ".pdf";
System.out.println ("Writing " + outFile);
Document document = new Document(reader.getPageSizeWithRotation(1));
PdfCopy writer = new PdfCopy(document, new FileOutputStream(outFile));
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
/* example :
java SplitPDFFile d:\temp\x\tx.pdf
Reading d:\temp\x\tx.pdf
Number of pages : 3
Writing d:\temp\x\tx-001.pdf
Writing d:\temp\x\tx-002.pdf
Writing d:\temp\x\tx-003.pdf
*/
}
}
Many iText examples here.
With PDDocument you can do so very easily.
You just have to use a Java List of PDDocument type and Splitter function to split a document.
List<PDDocument> Pages=new ArrayList<PDDocument>();
PDDocument.load(filePath);
try {
Splitter splitter = new Splitter();
Pages = splitter.split(document);
}
catch(Exception e) {
e.printStackTrace(); // print reason and line number where error exist
}
I can't comment, but this line in the most voted answer
Document document = new Document(reader.getPageSizeWithRotation(1));
should be
Document document = new Document(reader.getPageSizeWithRotation(i+1));
to get the correct pdf size if other pages have different page size (it know it's rare)
I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.