Converting part of .dox document to html using Apache POI

Converting part of .dox document to html using Apache POI - java

I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.

You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)

Related

Get size (in bytes) of a specific page in a PDF using iText

I'm using iText (v 2.1.7) and I need to find the size, in bytes, of a specific page.
I've written the following code:
public static long[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
long[] pageSizes = new long[pageCount];
for (int i = 0; i < pageCount; i++) {
pageSizes[i] = reader.getPageContent(i+1).length;
}
reader.close();
return pageSizes;
}
But it doesn't work properly. The reader.getPageContent(i+1).length; instruction returns very small values (<= 100 usually), even for large pages that are more than 1MB, so clearly this is not the correct way to do this.
But what IS the correct way? Is there one?
Note: I've already checked this question, but the offered solution consists of writing each page of the PDF to disk and then checking the file size, which is extremely inefficient and may even be wrong, since I'm assuming this would repeat the PDF header and metadata each time. I was searching for a more "proper" solution.

Well, in the end I managed to get hold of the source code for the original program that I was working with, which only accepted PDFs as input with a maximum "page size" of 1MB. Turns out... what it actually means by "page size" was fileSize / pageCount -_-^
For anyone that actually needs the precise size of a "standalone" page, with all content included, I've tested this solution and it seems to work well, tho it probably isn't very efficient as it writes out an entire PDF document for each page. Using a memory stream instead of a disk-based one helps, but I don't know how much.
public static int[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
int[] pageSizes = new int[pageCount];
for (int i = 0; i < pageCount; i++) {
try {
Document doc = new Document();
ByteArrayOutputStream bous = new ByteArrayOutputStream();
PdfCopy copy= new PdfCopy(doc, bous);
doc.open();
PdfImportedPage page = copy.getImportedPage(reader, i+1);
copy.addPage(page);
doc.close();
pageSizes[i] = bous.size();
} catch (DocumentException e) {
e.printStackTrace();
}
}
reader.close();
return pageSizes;
}

Problems with scanning

I'm studying Biomedical Informatics and I'm now doing my clinical practice, where I have to check that the charges made to hospitalized patients were performed correctly on supplies that are of unique charging (every procedure and supplies used have a codification).
I can import the Excel file on the software I'm doing but, I don't know now how to do the scan.
Here is the code (I'm doing it on NetBeans),
public class Portal extends javax.swing.JFrame {
private DefaultTableModel model;
public static int con = 0;
public ArrayList listas = new ArrayList();
public ArrayList listasr = new ArrayList();
public Portal() {
initComponents();
model = new DefaultTableModel();
jTable1.setModel(model);
}
private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {
JFileChooser examinar = new JFileChooser();
examinar.setFileFilter(new FileNameExtensionFilter("Archivos Excel", "xls", "xlsx"));
int opcion = examinar.showOpenDialog(this);
File archivoExcel = null;
if(opcion == JFileChooser.APPROVE_OPTION){
archivoExcel = examinar.getSelectedFile().getAbsoluteFile();
try{
Workbook leerExcel = Workbook.getWorkbook(archivoExcel);
for (int hoja=0; hoja<leerExcel.getNumberOfSheets(); hoja++)
{
Sheet hojaP = leerExcel.getSheet(hoja);
int columnas = hojaP.getColumns();
int filas = hojaP.getRows();
Object data[]= new Object[columnas];
for (int fila=0; fila < filas; fila++)
{
for(int columna=0; columna < columnas; columna++)
{
if(fila==0)
{
model.addColumn(hojaP.getCell(columna, fila).getContents());
}
System.out.println(hojaP.getCell(columna, fila).getContents());
if(fila>=1)
data[columna] = hojaP.getCell(columna, fila).getContents();
}model.addRow(data);
}
}
model.removeRow(0);
JOptionPane.showMessageDialog(null, "Excel cargado exitosamente");
}
}
}

Before you import the excel file save it as a csv(comma delimited) file(remeber to delete the headings). Then open the netbeans project folder under my documents, then open the your project folder and dump the csv file in their. Look at your project under files in netbeans open the folder and you will see the file in their. Now you said you want to read the file/ scan the file.
You can use my method at first, understand it and adapt to other scenarios you have in the future.
First create a class or use an readily created( you already created java class).
Declare arrays depending on how many rows you had in the excel file not the csv file and a counter.
Example two.
String [] patientsnamess;
int [] ages;
int count;
Now initiate the arrays in a deafault constructor(you don't have to because you can do it when you declare them but it is conventional). You can learn about constructors there are two I know of or there are only two but I will only show a default constructor.
It will look like this.
public yourClassName(){
patientsnames = new String[400];//the number in square brackets are an example it sets the size of the array. You can set the size according to how many patients there are or you could just use lists as the limit on the list as dependent on primary and virtual memory.
ages = new int[400];
count = 0;
}
now create the method two read the text file.
public void readFile(){
count = 0;//important
Scanner contents = null;
try{
contents = new Scanner(new FileReader("You file's name.txt");
while(contents.hasNext()){
String a = contents.nextLine();
String p[]= a.split("\\;");
patientsnames[count] = p[0];
ages[count] = p[1];
count++;//important
}
}
catch(FileNotFoundException e){
System.out.println(e.getMessage());
}
}
Now create get methods to call up the arrays with the values from the file.(Find out on rest of stackoverflow).
Remeber that field types link up with the data in the file.
I really hope this works for you. If not I am sorry but good luck with your Biochemical Informatics course.
Remeber to call the readFile method with an object in this case or it won't work.
Research the neccessary imports such as:
import java.io.*;
import java.util.*;

Apache-POI formatted text issue in .doc file

I am using this example in my project. Everything is fine, and replacing text is also good, But in my output text file, which should be "centered", have become left aligned. Input file - .doc, I have feeling that is breaking the formatting of documents, but I'm not sure what the problem is. How to resolve this problem?
public class HWPFTest {
public static void main(String[] args){
String filePath = "F:\\Sample.doc";
POIFSFileSystem fs = null;
try {
fs = new POIFSFileSystem(new FileInputStream(filePath));
HWPFDocument doc = new HWPFDocument(fs);
doc = replaceText(doc, "$VAR", "MyValue1");
saveWord(filePath, doc);
}
catch(FileNotFoundException e){
e.printStackTrace();
}
catch(IOException e){
e.printStackTrace();
}
}
private static HWPFDocument replaceText(HWPFDocument doc, String findText, String replaceText){
Range r1 = doc.getRange();
for (int i = 0; i < r1.numSections(); ++i ) {
Section s = r1.getSection(i);
for (int x = 0; x < s.numParagraphs(); x++) {
Paragraph p = s.getParagraph(x);
for (int z = 0; z < p.numCharacterRuns(); z++) {
CharacterRun run = p.getCharacterRun(z);
String text = run.text();
if(text.contains(findText)) {
run.replaceText(findText, replaceText);
}
}
}
}
return doc;
}
private static void saveWord(String filePath, HWPFDocument doc) throws FileNotFoundException, IOException{
FileOutputStream out = null;
try{
out = new FileOutputStream(filePath);
doc.write(out);
}
finally{
out.close();
}
}
}

HWPF is not usable for writing .doc files. It may work for very simple file content, but little extras break it. I fear you are out of luck here - if it is an option for you, you might want to use RTF files and work on those. Word should work correctly, if you rename the rtf extension to .doc (if you need the .doc extension).
(I developed a custom and working variant of HWPF for a client and know how difficult things can get there. The standard HWPF library will get in trouble when characters outside the 8-bit encoding exist, when tables are used, when text boxes are used, when graphics are embedded, ... . Some things in .doc files are also different from what is described in the official specification from Microsoft. Making a working HWPF library is "non-trivial" and requires a lot of spec-reading and investigation. If you want to go down the road to fix those bugs, you need to expect at least half a year of development effort.)

How to use Apache HWPF to extract text and images out of a DOC file

I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.
My very simple program is here:
I have 3 problems now:
Some of packages have errors (they can't find apache hdf). How I can fix them?
How I can use the methods of HWDF to find and extract the images out?
Some piece of my program is incomplete and incorrect. So please help me to complete it.
I have to complete this program in 2 days.
once again I repeat Please Please help me to complete this.
Thanks you Guys a lot for your help!!!
This is my elementary code :
public class test {
public void m1 (){
String filesname = "Hello.doc";
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filesname );
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String str = we.getText() ;
String[] paragraphs = we.getParagraphText();
Picture pic = new Picture(. . .) ;
pic.writeImageContent( . . . ) ;
PicturesTable picTable = new PicturesTable( . . . ) ;
if ( picTable.hasPicture( . . . ) ){
picTable.extractPicture(..., ...);
picTable.getAllPictures() ;
}
}

Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.

//you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
String fileName = "example.doc";
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
WordExtractor extractor = new WordExtractor(wordDoc);
String[] text = extractor.getParagraphText();
int lineCounter = text.length;
String articleStr = ""; // This string object use to store text from the word document.
for(int index = 0;index < lineCounter;++ index){
String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
int paragraphLength = paragraphStr.length();
if(paragraphLength != 0){
articleStr.concat(paragraphStr);
}
}
//you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
for(int i = 0;i < picturesList.size();++i){
BufferedImage image = null;
Picture pic = picturesList.get(i);
image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
if(image != null){
System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
}
}

If you just want to do this, and you don't care about the coding, you can just use Antiword.
$ antiword file.doc > out.txt

I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.

First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.

I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..

You could try OpenOffice API, but there arent many resources out there to tell you how to use it.

You can also try this one: http://www.dancrintea.ro/doc-to-pdf/

Looks like this could be the issue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting part of .dox document to html using Apache POI - java

I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.

Related

Get size (in bytes) of a specific page in a PDF using iText

Problems with scanning

Apache-POI formatted text issue in .doc file

How to use Apache HWPF to extract text and images out of a DOC file

Open Microsoft Word in Java

Categories

Resources