I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?
docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/
Related
Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.
I am preparing to embark on a large solo project at my place of employment. First let me describe the project. I have been asked to create a Java program that can take a CamT54 file (which is just a xml file) and have java display the information in table form. Then users should be given the ability to remove certain components from the table and have it go back to xml format with the changes.
I'm not well versed in dealing with XML in Java so this is going to be a learn and work task. Before I begin investing time I would like to know that my approach is the best approach.
My plan is to use DOM4J to do the parsing and handling of the xml. I will use a JTable to display the data and incorporate some buttons to the GUI that allow the modifications of the data through the use of some action listeners.
Would this be a plausible plan? Can DOM4J effectively allow xml data to be displayed in a table format and furthermore could that data be easily modified or deleted then resaved to a new xml?
I thought I would go ahead and answer this as I finished the program and wanted to post what I thought was the easiest solution in case anyone else needed help.
It turned out the easiest approach (for me at least) was to use the standard DOM parser, here are the steps I took.
Parsed the entire XML into String array lists. XPath was required for this, I also had to convert the elements into Strings and remove the extra tag information from the string using substrings since I only wanted the actual value.
I populated a JTable with these arrays.
Once users finished editing and clicked a save button then another Dom parser would take the original XML and change each and every attribute using the values from the Arrays (that were deleted and repopulated with the JTable cell values when the user clicked "save").
We have a requirement to read Word document and make changes to it with dynamic data from Application & some sections that will be updated by the user directly into the word.
Every time the user wants to fetch data from Application, he will upload the document and merge it.
To be more specific, I looked into option of manipulating the word document by adding meta tags as markers in the ooxml for each section. But i am not able to find any.Is there an option to add meta tags for content, which can act like template markers?
Note: We want to implement this in a Java application.
You can use special Java Library: http://poi.apache.org/ Apache POI to manipulate DOC files, but for template you need different system like velocity or freemarker. Or you can just use String replace.
I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).
I am not able to update value of a custom property in a word document using java.
I have a word document which contains a custom property with value 'stack'. Using java I want to change that value to 'overflow'. I used two approaches.
1) Using Apache POI
I set the org.apache.poi.hpsf.CustomProperties in org.apache.poi.hpsf.DocumentSummaryInformation and written it to POIFSFileSystem.
It does updates the value in word's customproperties table. But doesn't update the value for that respective property in the document. After the document is opened, I need to manually refresh the document to get that value updated.
2) Parse the document char by char and use the field codes DC3, DC4 and NAK to identify the location of custom properties. Replace the existing value with the new value.
Now the generated document contains the new values. But the problem here is, if the length of the old value and new value is different the document gets corrupted. I made sure the logic is good enough.
Any help would be appreciated.
To update the approach I took to solve the issue:
Using Office 2007 (docx) document as a template became easier. It is nothing but a zip document. You can open it using your zip application (winzip/7zip) and you can find many xml files inside it. document.xml contains the content, styles.xml contains formatting information and so on.
At runtime, I unzipped the document and parsed the document.xml, then used dom and updated with dynamic content. Custom properties are available in a separate xml file.
For 2003 users, they have to prepare the template using word application and save the document as XML then provide it as input.
No use of apache-poi now.