JAVA POI failed to write a large word file - java

I am using POI to delete "enter" in a .doc file (Blank line).
My code below works correctly when the input file is not large (for example, less than 1MB). However, when I deal with large input.doc that is 4mb, the output.doc is not correctly generated. I can not open the file.
Does anyone have better idea to write the big file correctly? Or, is there any other java code that can delete "enter" in a big .doc file? Thank you very much.
package mydoc;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
import java.io.*;
public class test {
/*The ASCII of "Enter" is 13*/
private static final short ENTER_ASCII = 13;
public static void main(String[] args){
/* the location of the input file */
String fileName = "D:\\input.doc";
deleteEnter(fileName);
}
public static void deleteEnter(String fileName){
POIFSFileSystem fs = null;
try{
fs = new POIFSFileSystem(new FileInputStream(fileName));
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
if (range.getParagraph(i).text().toCharArray()[0]==ENTER_ASCII)
{
range.getParagraph(i).delete();
}
}
FileOutputStream fos = null;
fos = new FileOutputStream(new File("D:\\output.doc"));
doc.write(fos);
fos.flush();
fos.close();
}//end try
catch (Exception e){
e.printStackTrace();
}//end catch
}
}

Depending on your needs you could even use a macro;
You should even be able to use regex like this: "^13{2,}", but that didn't work for me in Word 2010, see http://social.msdn.microsoft.com/Forums/en-US/0d921f97-b59a-48a9-a01a-20fe72f21c19/how-to-remove-blank-lines-?forum=worddev
Sub RemoveBlankLines()
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^p^p"
.Replacement.Text = "^p"
.MatchWildcards = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
Sub RemoveEnters()
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
'^11 or ^l New line
.Text = "^l"
.Replacement.Text = ""
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
'^13 or ^p Carriage return/paragraph mark
.Text = "^p"
.Replacement.Text = ""
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

"enter" is the line separator right ? It's platform dependant so I propose the above solution :
String separator = System.getProperty("line.separator")
file = new File(filename);
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
fileData[i] = fileData[i].replace(separator,"");
}
And then you just have to output fileData in a clean doc file.

Related

JAVA Apache POI wrong document write output

I'm reading a file, adding some data to paragraphs, and then writing out a document another file.
The problem I'm facing is that the output file is unreadable, I can't open it, and if a binary open it I can see that it don't have the correct format.
Every character has a ? character at the left side.
Can you give me some advice about what is happening?
Wrong output
Correct output
EDIT: Code Save function
FileOutputStream out = null;
try {
// Add true to make the data append possible in output stream.
out = new FileOutputStream(filePath, true);
doc.write(out);
out.flush();
} catch (Exception ex) {
ex.printStackTrace();
} finally {
out.close();
}
Edit file:
File file = new File("muestra.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(fs);
Range range = document.getRange();
for (int i = 0; i < document.getParagraphTable().getParagraphs().size(); i++) {
Paragraph p = range.getParagraph(i);
p.insertBefore("£");
}

How to replace DataXML from Slide Diagram in Powerpoint using Apache POI

i want to replace the one data.xml file of power point presentation in java using apache API with other file data.xml
For the reference i want to replace the following file with another power point file.
Following is the code i have tried but xml isnt replacing. I have different XML for both files every time i run after replacing using this code
public static void main(String[] args) {
// TODO Auto-generated method stub
final String filename = "C:/Users/skhan/Desktop/game.pptx";
final String filename1 = "C:/Users/skhan/Desktop/globe.pptx";
try {
XMLSlideShow ppt = new XMLSlideShow(new FileInputStream(filename));
OPCPackage pkg = ppt.getPackage();
PackagePart data = pkg.getPart(
PackagingURIHelper.createPartName("/ppt/diagrams/data1.xml"));
InputStream data1Inp = data.getInputStream();
XMLSlideShow ppt1 = new XMLSlideShow(new FileInputStream(filename1));
OPCPackage pkg1 = ppt1.getPackage();
PackagePart data11 = pkg1.getPart(
PackagingURIHelper.createPartName("/ppt/diagrams/data1.xml"));
InputStream data1Inp1 = data11.getInputStream();
String data1String = GetData(data1Inp);
String data2String = GetData(data1Inp1);
//i want to replace here
PrintStream pr = new PrintStream(data.getOutputStream());
pr.print(data2String);
pr.close();
System.out.println("Completed");
} catch (Exception e) {
e.printStackTrace();
}
}
public static String GetData(InputStream input) throws Exception
{
StringBuilder builder = new StringBuilder();
int ch;
while((ch = input.read()) != -1){
builder.append((char)ch);
}
String theString = builder.toString();
return theString;
}
I added the few line after changing in order to save the file.
The XMLSlideShow must write to some file after changing or adding.
File file =new File(filename);
FileOutputStream out = new FileOutputStream(file);
ppt.write(out);
out.close();

In need of a clear example on how to get the word count of DOC and DOCX files

I am able to read a DOC file and get its word count, BUT it is wrong.
My code:
public class WordCounter {
public static void main(String[] args) throws Throwable {
processDOC();
}
private static void processDOC() throws Throwable {
File file = new File("/Users/yjiang/Desktop/whatever.doc");
File file2 = new File("/Users/yjiang/Desktop/Test.docx");
File file3 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xls");
File file4 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xlsx");
try {
FileInputStream fs = new FileInputStream(file);
POIFSFileSystem poifsFileSystem = new POIFSFileSystem(fs);
DirectoryEntry directoryEntry = poifsFileSystem.getRoot();
DocumentEntry documentEntry = (DocumentEntry) directoryEntry.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(documentEntry);
PropertySet ps = new PropertySet(dis);
SummaryInformation si = new SummaryInformation(ps);
System.out.println(si.getWordCount());
} catch (Exception e) {
e.printStackTrace();
}
try {
HWPFDocument hwpfDocument = new HWPFDocument(new FileInputStream(file));
System.out.println(hwpfDocument.getDocProperties().getCWords()); // actually 71 words using word count in MSWord, returned 57.
System.out.println(hwpfDocument.getDocProperties().getCWordsFtnEnd());
XWPFDocument xwpfDocument = new XWPFDocument(new FileInputStream(file2)); // actually 71 words using word count in MSWord, returned 57.
System.out.println(xwpfDocument.getProperties().getExtendedProperties().getUnderlyingProperties().getWords());
System.out.println();
} catch (Exception e) {
e.printStackTrace();
}
}
}
"whatever.doc" has 71 words, when I run this, it returns only 57.
Seems I cannot use the same method to read DOCX files, when I run it I get the following:
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
Could provide an example?
I've also found that the built-in word counters give strange counts, but text extraction seems to be more reliable, so I use this solution:
public long getWordCount(File file) throws IOException {
POITextExtractor textExtractor;
if (file.getName().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
textExtractor = new XWPFWordExtractor(doc);
}
else if (file.getName().endsWith(".doc")) {
textExtractor = new WordExtractor(new FileInputStream(file));
}
else {
throw new IllegalArgumentException("Not a MS Word file.");
}
return Arrays.stream(textExtractor.getText().split("\\s+"))
.filter(s -> s.matches("^.*[\\p{L}\\p{N}].*$"))
.count();
}
The regex at the bottom can be adjusted if needed, but overall this one has proved fairly resilient.

Creating a file dynamically through jsp

I have a block of jsp code like this. Here blockerdata, criticaldata, majordata and minordata are stringbuilder strings and their value is appended through a loop and value is assigned dynamically. Now I'm tryong to write them into an xml file like this.
<%
System.out.println(blockerdata);
System.out.println(criticaldata);
System.out.println(majordata);
System.out.println(minordata);
try
{
File file1 = new File("WebContent/criticaldata.xml");
File file2 = new File("WebContent/majordata.xml");
File file3 = new File("WebContent/minordata.xml");
File file4 = new File("WebContent/blockerdata.xml");
FileOutputStream fop1 = new FileOutputStream(file1);
FileOutputStream fop2 = new FileOutputStream(file2);
FileOutputStream fop3 = new FileOutputStream(file3);
FileOutputStream fop4 = new FileOutputStream(file4);
// if file doesnt exists, then create it
if (!file1.exists()) {
file1.createNewFile();
}
if (!file2.exists()) {
file2.createNewFile();
}
if (!file3.exists()) {
file3.createNewFile();
}
if (!file4.exists()) {
file4.createNewFile();
}
// get the content in bytes
byte[] contentInBytes1= criticaldata.toString().getBytes();
byte[] contentInBytes2= majordata.toString().getBytes();
byte[] contentInBytes3= minordata.toString().getBytes();
byte[] contentInBytes4= blockerdata.toString().getBytes();
fop1.write(contentInBytes1);
fop2.write(contentInBytes1);
fop3.write(contentInBytes1);
fop4.write(contentInBytes1);
fop1.flush();
fop2.flush();
fop3.flush();
fop4.flush();
fop1.close();
fop2.close();
fop3.close();
fop4.close();
}
catch ( IOException e)
{
}
%>
Problem is, the code doesn't seem to be working. I tried to do it using printwriter also but
the files are not being generated. Also I want to rewrite the file if it already exists. Can somebody please help me on how to do this ?

Read embedded MSWord file from excel uand save it on drive using Java

I have excel file with two Ms Word file embedded in it.
I am using Apache POI to read Embedded object from excel file in java.
Problem is when I read embedded file and save it on Disk and opened saved file in Ms Word, MS Word couldn't read its format.
If opened from excel file directly it opened and Ms Word read it properly.
Anyone help me.
[Code]
public class test {
public static void main(String[] args) throws Exception {
File file = new File("C:/Book2.xls");
NPOIFSFileSystem fs = new NPOIFSFileSystem(file);
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
String oleName = obj.getOLE2ClassName();
DirectoryNode dn = (DirectoryNode)obj.getDirectory();
Iterator<Entry> ab = dn.getEntries();
if(oleName.contains("Document")){
HWPFDocument embeddedWordDocument = new HWPFDocument(dn);
String docTitle = embeddedWordDocument.getSummaryInformation().getTitle();
InputStream is ;
Entry entry = ab.next();
is = dn.createDocumentInputStream(entry);
FileOutputStream fos = new FileOutputStream("d:/"+docTitle+".doc");
System.out.println(is.available());
System.out.println(((DocumentEntry)entry).getSize());
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
fs.close();
}
}

Categories