In this question someone had a similar problem I have: I want to read the content of a .pptx file (only the text), but only got it work with .ppt files. So I tried to solve it with the accepted answer, but I got this exception: java.lang.ClassNotFoundException: org.apache.poi.hslf.model.TextPainter$Key
I used the example from this page (which was suggested in the accepted answer) so I have no idea why it does not work. My code:
public static String readPPTX(String path) throws FileNotFoundException, IOException{
XMLSlideShow ppt = new XMLSlideShow(new FileInputStream(path));
String content = "";
XSLFSlide[] slides = ppt.getSlides();
for (XSLFSlide slide : slides){
XSLFShape[] sh = slide.getShapes();
for (int j = 0; j < sh.length; j++){
if (sh[j] instanceof XSLFTextShape){
XSLFTextShape shape = (XSLFTextShape)sh[j];
content += shape.getText() + "\n";
}
}
}
return content;
}
Solution to this issue is to add the poi-scratchpad-3.9.jar file to the classpath of the project.
Related
I have a Spring action which should read in an Excel file and access the values contained in the cells. I am using the Java Excel API to process the Excel file. I want it to read from an external folder and not from the root of my project. The problem I am having is Spring MVC cannot recognize my file. I get this error almost every time:
The filename, directory name, or volume label syntax is incorrect
I even tried something like:
Resource banner = resourceLoader.getResource("http://howtodoinjava.com/readme.txt");
as seen here, and still no luck. My controller code:
#RequestMapping(value="migrateexcel", method = RequestMethod.POST)
public String migrateexcel(Model uiModel, HttpServletRequest httpServletRequest) throws BiffException, IOException {
String fileName = httpServletRequest.getParameter("filename");
if (! fileName.isEmpty() && fileName!= null) {//just doing this as the logic kicks in from a form
FileSystemResource resource = new FileSystemResource("file:C:/Users/Administrator/Desktop/myfolder/test.xlsx");
File xlsFile = resource.getFile();
String temp = "";
Workbook workbook = Workbook.getWorkbook(xlsFile);
Sheet sheet = workbook.getSheet(0);
for(int i =0; i<sheet.getRows(); i++){
for(int j =0; j < sheet.getColumns(); j++){
Cell cell = sheet.getCell(j,i);
temp = cell.getContents();
System.out.println(temp);
}
}
uiModel.addAttribute("message", "Excel Data migrated successfully");
return "rates/processexcel";
}else{
uiModel.addAttribute("message", "Please select a file before submiting");
return "rates/processexcel";
}
}
I tested it, and it seems like the only problem is the way you reference the file.
This two ways should work (Unless there is some kind of permission problem):
"C:/Users/Administrator/Desktop/myfolder/test.xlsx"
or
"C:\\Users\\Administrator\\Desktop\\myfolder\\test.xlsx"
Spring and Spring MVC are actually irrelevant for the question, but the constructors of FileSystemResource are expecting the same syntax in a String, or a java.io.File.
I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.
You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)
I am using this example in my project. Everything is fine, and replacing text is also good, But in my output text file, which should be "centered", have become left aligned. Input file - .doc, I have feeling that is breaking the formatting of documents, but I'm not sure what the problem is. How to resolve this problem?
public class HWPFTest {
public static void main(String[] args){
String filePath = "F:\\Sample.doc";
POIFSFileSystem fs = null;
try {
fs = new POIFSFileSystem(new FileInputStream(filePath));
HWPFDocument doc = new HWPFDocument(fs);
doc = replaceText(doc, "$VAR", "MyValue1");
saveWord(filePath, doc);
}
catch(FileNotFoundException e){
e.printStackTrace();
}
catch(IOException e){
e.printStackTrace();
}
}
private static HWPFDocument replaceText(HWPFDocument doc, String findText, String replaceText){
Range r1 = doc.getRange();
for (int i = 0; i < r1.numSections(); ++i ) {
Section s = r1.getSection(i);
for (int x = 0; x < s.numParagraphs(); x++) {
Paragraph p = s.getParagraph(x);
for (int z = 0; z < p.numCharacterRuns(); z++) {
CharacterRun run = p.getCharacterRun(z);
String text = run.text();
if(text.contains(findText)) {
run.replaceText(findText, replaceText);
}
}
}
}
return doc;
}
private static void saveWord(String filePath, HWPFDocument doc) throws FileNotFoundException, IOException{
FileOutputStream out = null;
try{
out = new FileOutputStream(filePath);
doc.write(out);
}
finally{
out.close();
}
}
}
HWPF is not usable for writing .doc files. It may work for very simple file content, but little extras break it. I fear you are out of luck here - if it is an option for you, you might want to use RTF files and work on those. Word should work correctly, if you rename the rtf extension to .doc (if you need the .doc extension).
(I developed a custom and working variant of HWPF for a client and know how difficult things can get there. The standard HWPF library will get in trouble when characters outside the 8-bit encoding exist, when tables are used, when text boxes are used, when graphics are embedded, ... . Some things in .doc files are also different from what is described in the official specification from Microsoft. Making a working HWPF library is "non-trivial" and requires a lot of spec-reading and investigation. If you want to go down the road to fix those bugs, you need to expect at least half a year of development effort.)
I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.
My very simple program is here:
I have 3 problems now:
Some of packages have errors (they can't find apache hdf). How I can fix them?
How I can use the methods of HWDF to find and extract the images out?
Some piece of my program is incomplete and incorrect. So please help me to complete it.
I have to complete this program in 2 days.
once again I repeat Please Please help me to complete this.
Thanks you Guys a lot for your help!!!
This is my elementary code :
public class test {
public void m1 (){
String filesname = "Hello.doc";
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filesname );
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String str = we.getText() ;
String[] paragraphs = we.getParagraphText();
Picture pic = new Picture(. . .) ;
pic.writeImageContent( . . . ) ;
PicturesTable picTable = new PicturesTable( . . . ) ;
if ( picTable.hasPicture( . . . ) ){
picTable.extractPicture(..., ...);
picTable.getAllPictures() ;
}
}
Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.
//you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
String fileName = "example.doc";
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
WordExtractor extractor = new WordExtractor(wordDoc);
String[] text = extractor.getParagraphText();
int lineCounter = text.length;
String articleStr = ""; // This string object use to store text from the word document.
for(int index = 0;index < lineCounter;++ index){
String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
int paragraphLength = paragraphStr.length();
if(paragraphLength != 0){
articleStr.concat(paragraphStr);
}
}
//you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
for(int i = 0;i < picturesList.size();++i){
BufferedImage image = null;
Picture pic = picturesList.get(i);
image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
if(image != null){
System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
}
}
If you just want to do this, and you don't care about the coding, you can just use Antiword.
$ antiword file.doc > out.txt
I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.
I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.