i'm currently working with Excel files (*.xlsm) and Apache POI , and i have been cracking my head over a task.
I receive some excel files that have PDFs embedded in it and i want to extract them and rename them based on the row and column they are in.
This seems weird as i know the embedded objects are represented as images ,they can occupy more than one cell and technically they are not "In" the cell.
The following code snippet lets me extract the embedded PDFs but they are named OleObject[1..2..3.etc..] wich doesnt give me any clue.
inStream = new FileInputStream(file);
XSSFWorkbook workbook = new XSSFWorkbook(inStream);
for (PackagePart pPart : workbook.getAllEmbedds()) {
String contentType = pPart.getContentType();
if (contentType.equals("application/vnd.openxmlformats-officedocument.oleObject")){
POIFSFileSystem fs = new POIFSFileSystem(pPart.getInputStream());
TikaInputStream stream = TikaInputStream.get(fs.createDocumentInputStream("CONTENTS"));
byte[] bytes = IOUtil.toByteArray(stream);
stream.close();
OutputStream outStream = new FileOutputStream(new File(ROOT_DIRECTORY.getAbsolutePath()+"\\PDF"+i+".pdf"));
IOUtil.copy(bytes, outStream);
outStream.close();
}}
I wanted to know if org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet will let me see the xml code of the excell sheet and maybe eith taht i can get the info i need. Like this.
<oleObjects><mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"><mc:Choice Requires="x14"><oleObject progId="Acrobat Document" dvAspect="DVASPECT_ICON" shapeId="1028" r:id="rId4"><objectPr defaultSize="0" r:id="rId5"><anchor moveWithCells="1"><from><xdr:col>8</xdr:col><xdr:colOff>0</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>0</xdr:rowOff></from><to><xdr:col>8</xdr:col><xdr:colOff>1143000</xdr:colOff><xdr:row>13</xdr:row><xdr:rowOff>171450</xdr:rowOff></to></anchor></objectPr></oleObject></mc:Choice><mc:Fallback><oleObject progId="Acrobat Document" dvAspect="DVASPECT_ICON" shapeId="1028" r:id="rId4"/></mc:Fallback></mc:AlternateContent></oleObjects>
--
<objectPr defaultSize="0" r:id="rId5"><anchor moveWithCells="1"><from><xdr:col>8</xdr:col><xdr:colOff>0</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>0</xdr:rowOff></from><to><xdr:col>8</xdr:col><xdr:colOff>1143000</xdr:colOff><xdr:row>13</xdr:row><xdr:rowOff>171450</xdr:rowOff></to></anchor></objectPr>
I guess using the anchor information would be possible but im just unable to find how to get it.
Hope this information makes things clear on what im trying to do .
Thanks in advance.
I've looked at the source code for the current poi-ooxml-schemas sources jars which you can locate here: http://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.3/
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet extends org.apache.xmlbeans.XmlObject which can give you the XML as a string using the inherited .toString() method. Or you can quickly access the list of OLE objects in the worksheet by calling getOleObjects() on your CTWorksheet object.
/**
* Gets the "oleObjects" element
*/
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTOleObjects getOleObjects();
CTOleObjects itself extends org.apache.xmlbeans.XmlObject and again you can get the XML using toString() for parsing, or get a list of org.openxmlformats.schemas.spreadsheetml.x2006.main.CTOleObject OLE objects for iteration using CTOleObjects.getOleObjectList().
/**
* Gets a List of "oleObject" elements
*/
java.util.List<org.openxmlformats.schemas.spreadsheetml.x2006.main.CTOleObject> getOleObjectList();
CTOleObject doesn't seem to have getter methods to get the and child XML elements to allow you to determine the columns, so I think you would need to do some XML parsing, or string searching to get this info if it is contained in the string XML representation.
Hope this helps.
Related
I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!
Reading a word document for example SampleOne.doc and storing it in to a byte[].
#Column(name = "LETTER_WORD_EDITOR_VALUE")
private byte[] letterWordEditorValue;
It is a blob in DB.
I want to read the contents of another word document for Example SampleTwo.doc as byte[] and appending both the byte[] and setting the resultant byte[] in to letterWordEditorValue.
Below is the code to do that.
FileInputStream fileRead = new FileInputStream(fileNameWithPath);
byte[] readData=IOUtils.toByteArray(fileRead);
byte[] one = readData;
byte[] two = inquiryCor.getLetterWordEditorValue();
byte[] combined = new byte[one.length + two.length];
System.arraycopy(one,0,combined,0,one.length);
System.arraycopy(two,0,combined,one.length,two.length);
inquiryCor.setLetterWordEditorValue(combined);
Below is the code to read the letterWordEditorValue and writing in to a Word-File.
fileEditOutPutStream = new FileOutputStream(fileNameWithPath);
fileEditOutPutStream.write(inquiryCor.getLetterWordEditorValue());
fileEditOutPutStream.close();
The contents of word file is not the contents of one+two, Rather it contents readData value only. But when printing the combined.length i.e resultant length is printing sum of one.length+two.length.
Why above code is not appending contents of two word document?
What am i doing wrong? Please guide me to solve this issue.
Thanks!
It's not possible to combine two proprietary documents via simple bytearray-concatenation. That wouldn't even make any sense. You need to parse the two documents via some library and put them together manually. What you were trying to do is like trying to use two motors inside of one car by attaching a second car to the first one ... does not compute!
Apache offers a library for office documents : https://poi.apache.org/
Replace
fileEditOutPutStream.write(inquiryCor.getLetterWordEditorValue());
with
fileEditOutPutStream.write(inquiryCorrespondence.getLetterWordEditorValue());
I have been trying to edit different types of documents using Apache POI. The script should handle both extensions .doc and .docx. I could successfully edit the .docx file using XWPF api and the required text was added at the end of the docx file.
For editing .doc files(which include header, footer and a few paragraphs), following script is used, which use HWPFDocument.
FileInputStream fis = new FileInputStream(args[0]);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
CharacterRun run = range.insertAfter("FROM SEHWAGGG A FOUUURRRRRR");
run.setBold(true);
run.setItalic(true);
The script works fine with normal documents which does not have header and footer. But seems that the issue appears with complex documents. It insert text, but in between the paragraphs (and at the beginning using insertBefore()). There are no text replacements required, just have to put the text at the end of the document. I searched similar scripts but most of them handle text replacement.
How can I add the text at the end, after all paragraphs?
I've tested It with the following document:
At first (with your original code) it completely destroyed the document:
By changing the following line, the insert works fine for me:
// Old
Range range = doc.getEndnoteRange();
// New
Range range = doc.getEndnoteRange();
I'm afraid you are out of luck with HWPF with the current state of the project.
I created a custom HWPF library for one of our clients, but the changes are not public. The changes were huge, so you can't spend - say - a week and assume that things will be fixed. You might get away with the current public HWPF when only some text needs to be replaced without changing the string length ("abc" -> "123" or "a " -> "1234").
I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?
Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.
I am trying to pull data from microsoft-word and translate it to sql statement and inserting it an Oracle database.
When the data in ms-word contains a new line that is created by [Shift-Enter] and not just enter,
The text contains an icon that looks like a box with a question mark.
Where the ET is just standard new line using the enter key and the ST is new lines using the
Shift-Enter combination. So when generating the SQL and inserting it to oracle, oracle counts that not as a text, but as hex.
My question is, how to remove lines that is created by [shift-enter] to just a standard '\n'?
Thanks
Update
This is how i get the text information
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(file));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
text = we.getText();
Update Answer:
This was a bug in poi-3.6. In poi-3.8 it shows as \r.
What you're almost certainly seeing are "fields" in the word document, which are special blocks of text such as links, macros etc
Option number one is to continue using WordExtractor, but call stripFields(String) on the resulting text before using it. That'll remove any of these fields from the text for you.
The other option is to use a different way of getting the text out. WordToTextConverter is part of Apache POI, and is more complex code that handles more of the format and should skip these for you (WordExtractor is pretty simple and low level). The other is to use Apache Tika, which provides a common way of extracting text from a number of file formats. That does have the proper code to deal with fields, and as an added bonus it'll be trivial for you to support .docx or .pdf when your requirements change!