Manipulation of Word Document using OOXML? - java

We have a requirement to read Word document and make changes to it with dynamic data from Application & some sections that will be updated by the user directly into the word.
Every time the user wants to fetch data from Application, he will upload the document and merge it.
To be more specific, I looked into option of manipulating the word document by adding meta tags as markers in the ooxml for each section. But i am not able to find any.Is there an option to add meta tags for content, which can act like template markers?
Note: We want to implement this in a Java application.

You can use special Java Library: http://poi.apache.org/ Apache POI to manipulate DOC files, but for template you need different system like velocity or freemarker. Or you can just use String replace.

Related

Extracting data from a PDF

I have a system that ultimately creates a PDF files from html file. It works very similar to a mail merge. It grabs data from a database, merge's the data into palceholders in the html document and then converts the html file to a pdf.
When I am unit testing the html file I can look at the values in my place holder. For example if I had a John Smith and I want to validate that the name is "John Smith" I simply look the value of the div after the merge.
I need to do something similar with validating the data in the pdf. Using pdfbox and itext I was able to extract text from a location as well as text from the document but I can't find anything that would let me create a "tag/placeholder/..." and extract information from it similar to what I do with the html file.
Is this possible with a pdf?
That's perfectly possible using pdf2Data, which is a solution from the iText suite.
You can find the demo here
http://pdf2data.online/
It essentially does exactly what you described, you are given a viewer and some tools that allow you to define areas of interest (what you called 'placeholders').
Areas of interest can be defined using:
coordinates
relative to other areas of interest
relative to text or regular expressions
matching a certain regular exression
matching a table
etc
The tool then stores your template as an XML file, and you can use java or .NET code to extract information from a PDF that matches the template.
You are given either a json-like datastructure, or an XML file.
That should make it relatively straightforward to test whether a given area of interest contains a piece of text.

Split docx to multiple docx using Java

I have a requirement to split 1 docx to multiple docx based on subheadings.
where input document have TOC, graphs, paragraphs, tables , images and drawing tools .
I have a write a app to get a docx and generate multiple docx based on subheading.
I could see few resource for paragraph read and write but couldn't find for others. any suggestions to clone the doc and write as is in order to maintain the same style and format.
Thanks in advance
There are at least 2 ways to do this. The first is to use a clone of the entire document, but only including the relevant portion of the main document part. This is fairly easy to do, but the output documents might be large (since they contain unused images etc), unless you open/save in Word.
The second would be to use our commercial Docx4j Enterprise. You still have to identify where each chunk starts and finishes, but it will take just the objects referenced in that chunk (so you get small output documents).

read the contents from html and write in excel sheet in java

Please help me to read a html file and write it in an excel file by using java . I have searched the net , I could get only for copying tables , I need the data in it to be read and write in an excel file .
contents in html file
<title>**Deprecated Method Found**</title>
<h2>**Summary**</h2>
Deprecated API is error-prone and is a potential security threat and thus should not be used.
<h2>**Description**</h2>
Old API is sometimes marked deprecated because its implementation is designed in a way that can be error-prone. Deprecated API should be avoided where possible.
<h2>**Security Implications**</h2>
Blocks of code that use deprecated API are designed in a careless manner and thus are a potential security threat.
I need these separate headings in separate columns and the contents in rows .
Is it possible to parse this html to excel file
Try reading the HTML using jsoup, a Java HTML Parser. You can then save it as a CSV (Comma separated values) format which can be opened in excel.
jsoup to read text after element
Get all h2 elements using
Elements h2s = document.select("h2");
Then iterate over the elements reading what the heading says. If this heading is important, use the following code to get the text following that tag.
String text = h2.nextSibling().toString();
Sample CSV opened in Excel:

how to create a new word from template with docx4j

I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?
docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/

Updating values of custom properties in word doc using java

I am not able to update value of a custom property in a word document using java.
I have a word document which contains a custom property with value 'stack'. Using java I want to change that value to 'overflow'. I used two approaches.
1) Using Apache POI
I set the org.apache.poi.hpsf.CustomProperties in org.apache.poi.hpsf.DocumentSummaryInformation and written it to POIFSFileSystem.
It does updates the value in word's customproperties table. But doesn't update the value for that respective property in the document. After the document is opened, I need to manually refresh the document to get that value updated.
2) Parse the document char by char and use the field codes DC3, DC4 and NAK to identify the location of custom properties. Replace the existing value with the new value.
Now the generated document contains the new values. But the problem here is, if the length of the old value and new value is different the document gets corrupted. I made sure the logic is good enough.
Any help would be appreciated.
To update the approach I took to solve the issue:
Using Office 2007 (docx) document as a template became easier. It is nothing but a zip document. You can open it using your zip application (winzip/7zip) and you can find many xml files inside it. document.xml contains the content, styles.xml contains formatting information and so on.
At runtime, I unzipped the document and parsed the document.xml, then used dom and updated with dynamic content. Custom properties are available in a separate xml file.
For 2003 users, they have to prepare the template using word application and save the document as XML then provide it as input.
No use of apache-poi now.

Categories