docx Template Docx4j replacing text in Java

docx Template Docx4j replacing text in Java - java

Im new to Docx4j and my task is to replace some Text of a docx Template.
I read the getting Started Guide of docx4j but I don't think I fully understood the whole concept.
Well Anyway... I already tried [the unmashalling Template of Docx4j][1],
which worked fine with the given docx, but then I got the same Problem when I tried it on my own template
The Exceptions say, that the HashMap doesnt contain valid keys or values, and therefore it doesnt replace the placeholders.
I replaced the
<w:proofErr w:type="spellEnd"/>
by disabling the spellchecking, but it still didn't work... And it also takes quite some time to run the app.
In didn't understand the databound example in the Getting_Started.pdf, so I'm running out of options...
How can I simply replace some String-Texts from a docx?
EDIT:
I found out that if I add some Text to the unmarshallFromTemplate.docx and save it, that it wont replace the new lines of text.
the - Tags are somehow splitted into multiple Tags:
<w:p w:rsidR="002512F8" w:rsidRDefault="002512F8" w:rsidP="002512F8"><w:r><w:t>My</w:t></w:r><w:r w:rsidR="001A5174"><w:t xml:space="preserve"> favourite ice cream is ${DEGREE</w:t></w:r><w:r><w:t>}.</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p>
editing the Text in the document.xml, and adding the missing Information didnt help much.
well anyway here is the document.xml of the Template.docx that im using:
http://uploaded.net/file/vz4qr23o
EDIT 2:
Well guys. I found a quite suitable workaround for myself and dont know why it took so long to figure it out.
As I was saying: The runs where splited up, and the reason for this was the ${} in my opinion. Therefore I simply used a # before my Placeholders and rewrote every placeholder, so that it would all be in one run.
Had to switch couple of times to the document.xml and rewrite the passages but then it worked. Then I simply used a replace(placeholder, xml) and replaced the text of the marshalled document.xml, then I unmarshalled it again.
Worked. End of Story, fuck the nightly build or the mappings. THX

docx4j source code has been on GitHub for a while now; that svn repository is obsolete.
The equivalent sample is now called VariableReplace. That code is a bit more efficient, but you need to build it yourself, or use a current nightly build.
You'll probably find running VariablePrepare addresses your issue.

The placeholder search and replace code built in to docx4j works just fine, but if you're having issues with placeholders getting broken up by rsid entities, you need to ensure that you have grammar and spell-checking disabled when saving your "template" (i.e. source) document. This will help prevent your text runs becoming fragmented (note that you might want to disable proof-reading too, as that inserts bookmark tags here there and everywhere).
Once you've done the search and replace and have a new / updated document, you can re-enable spell-checking easily enough. This thread has more on RSIDs: turnoff rsid's spell check & grammar check in generated xml

Related

JTidy reports "3 errors were found!"... but does not say what they are

I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code:
StringReader inStr = new StringReader(htmlInput);
StringWriter outStr = new StringWriter();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parseDOM(inStr, outStr);
I get the following output:
InputStream: Document content looks like HTML 4.01 Transitional
247 warnings, 3 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
Trouble is, Tidy doesn't tell me what 3 errors it found.
I'm fibbing here a little. The output above actually follows a long list of all 247 warnings (mostly trimming out empty div elements). I can suppress those with tidy.setShowWarnings(false); either way, I see no error report, so I can't figure out what I need to fix. 300Kb of HTML is too much for me to eyeball.
I've tried numerous approaches to finding the error. I can't run it through validate.w3.org, sadly, as the HTML file is on a proprietary network. The most informative approach was to open it in IntelliJ IDEA; this revealed a dozen or so duplicate div IDs, which I fixed. Errors still occurred.
I've looked around for other mentions of this problem. While I find plenty of hits on things like "How can I get the error/warning messages out of the parsed HTML using JTidy?", they all appear to be asking for dissimilar things, or assume conditions that simply aren't holding for me. I'm getting warnings just fine, for example; it's the errors I need, and they're not being reported, even if I call setShowErrors(100) or something.
Am I going to have to dive into Tidy's source code and debug it, starting where it reports errors? Or is there something much simpler I could do?

Here's what I ended up doing to track down the errors:
Download JTidy's source. Most people should be able to go straight to the source.
Unzip the source into my dev area. Right on top of my existing source code. This also meant removing the Maven entry for JTidy from my pom.xml. (It also meant beating IntelliJ into submission (re: editing the relevant .iml files and restarting IJ a lot) when it got extremely confused by this.)
Set a breakpoint in Report.error. The first line of org.w3.tidy.Report.error() increments lexer.errors; error() is called from many places in the lexer.
Run my program in debug mode. Expect this to take a little while if the input HTML is large; a 300k file took around 10-15 seconds on my machine to stop on an error that turned out to be at the very end of the file.
Look at the contents of lexbuf. lexbuf is a byte array, so your IDE might not show it as text. It might also be large. You probably want to look at what index the lexer was looking at within lexbuf. If you have to, take that section of the byte array and cross-reference it with an ASCII table to get the text.
Search for that text in your HTML. Assuming it appears only once, there's your error. (In my case, it appeared exactly three times, and sure enough, I had three errors reported.)
This was much more involved than it probably should have been. I suspect Report.error() was being called inappropriately.
In my case, error() was called with the constant BAD_CDATA_CONTENT. This constant is used only by Report.warning(). error() doesn't know what to do with it, and just exits silently with no message at all. If I change the call in Lexer.getCDATA() from error() to warning(), I get the exact line and column of my error. (I also get what appears to be reasonably well-formed XHTML, instead of an empty document.)
I'd submit a ticket to the JTidy project with some suggestions, but SourceForge isn't letting me log in for some reason. So, here:
Given that this "error" appears not to doom the document to unparseability, I'll tentatively suggest that that call be made a warning instead. (In my specific case, it was an HTML tag inside a string constant or comment inside a script element; shouldn't have hurt anything. I asked another question about it, just in case.)
Report.error() should have a default case that reports an unhandled error code if it gets one.
Hope this helps anyone else having what I'm guessing is a rather esoteric problem.

Is there a clean way to to transform text files that are not the same into a standard format

I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.

Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).

Searching for words like "UTTD_Equip_City_TE" in Lucene

Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.

KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?

I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis

How do you Edit MP4 ID3 Tags in Java?

I asked a similar question some time ago, but with python, and have since then decided to switch to Java because there seemed to be more resources to do this sort of thing. Basically I need some sort of library, idea, or instructions that would allow me to edit ID3 tags in an MP4 file like the kind found in iTunes. If anyone knows anything, your help would be greatly appreciated.
So far I've done the following:
I've found this question/answer to a very similar problem: How do you Edit Video ID3v2 Tags in Java (it describes how to use a library intended for audio files called JID3 to edit video ID3 tags), but I can't figure out for the life of me how to actually import it into an eclipse project and use it. I basically unpacked it and added all the packages into the project, but the one time it worked it made the movie file unreadable to any media player afterwards. If anyone has specific knowledge of how to import and use JID3 that would be great.
I've found this site: http://willcode4beer.com/parsing.jsp?set=mp3ID3 which has some seemingly good code for reading and writing ID3 tags, unfortunately it does not work properly constantly returning strings of question marks or telling me that the file is not there spontaneously (it will literally work one time and then not work another time without any changes). Nevertheless I like the idea of simply reading the bytes or ASCII of a file and finding/editing the ID3 tag that way so if anyone knows what to do for that, that'd be awesome.
Thanks in advance.

The metadata in MP4 is not necessarily in ID3 format. There is the possibility to use ID3 but it is not widely used. The ID3 bytes are then in /moov/meta/id32 box.
The iTunes files bear their meta information in /moov/udta/... there are multiple boxes like '#cmt', '#nam', '#des', '#cpy' that contain each a string for (in this case) comment, name, description, copyright. Have a look at http://code.google.com/p/mp4parser/ to visualize, parse and write MP4 files.

If I understand you correctly you want to be able to edit such metadata as: artist, track, cover image etc. and then be able to see your changes in iTunes or QuickTime.
In that case you may want to look at the new API available in JCodec (org.jcodec.movtool.MetadataEditor).
It also has a CLI (org.jcodec.movtool.MetadataEditorMain).
Here's the basic usage:
# Changes the author of the movie
./metaedit -f -si ©ART=New\ value file.mov
or the same thing via the Java API:
MetadataEditor mediaMeta = MetadataEditor.createFrom(new
File("file.mp4"));
Map<Integer, MetaValue> meta = mediaMeta.getItunesMeta();
meta.put(0xa9415254, MetaValue.createString("New value")); // fourcc for '©ART'
mediaMeta.save(false); // fast mode is off
You can find a complete documentation here: http://jcodec.org/docs/working_with_mp4_metadata.html

Search for commented-out code across files in Eclipse

Is there a quick way to find all the commented-out code across Java files in Eclipse?
Any option in Search, perhaps, or any add-on that can do this?
It should be able to find only code which is commented out, but not ordinary comments.

In Eclipse, I just do a file search with the regular expression checkbox turned on:
(/\*.*;.*\*/)|(//.*;)
It will find semicolons in
// These;
and /* these; */
Works for me.

Sonar can do it: http://www.sonarsource.org/commented-out-code-eradication-with-sonar/

You can mark your own commented code with a task tag. You can create your own task tags in Eclipse.
From the menu, go to Window -> Preferences. In the Preferences dialog, go to General -> Editors -> Structured Text Editors -> Task Tags.
Add an appropriate task tag, like COMMENTED. Set the priority to Low.
Then, any code you comment out, you can mark with the COMMENTED task tag. A list of these task tags, along with their locations, appears in the Tasks view.

#Jorn said:
I think [the OP] wants to find code that is commented out, not code that has a comment.
If the intention is to find commented out code, then I don't think it is possible in general. The problem is that it is impossible to distinguish between comments that were written as code or pseudo-code, and code that is commented out. Making that distinction requires human intelligence.
Now IDE's typically have a "toggle comments" function that comments out code in a particular way. It would be feasible to write a tool / plugin that matches the style produced by a
particular IDE. But that's probably not good enough, especially since reformatting the code typically gets rid of the characteristics that made the commented out code recognizable.

If the problem is to find commented-out code, what is needed is a way to find comments, and way to decide if a comment might contain code.
A simple way to do this is to search for comment that contain code-like things. I'd be tempted to hunt for comments containing a ";" character (or some other rare indicator such as "="); it will be pretty hard to have any interesting commented code that doesn't contain this and in my experience with comments, I don't see many that people write that contain this. A regexp search for this should be pretty straightforward, even if it picked up a few addtional false positives (e.g. // in a string literal).
A more sophisticated way to accomplish this is to use a Java lexer or parser. If you have a lexer that returns comments at tokens (not all of them do, Java compilers aren't interested in comments), then you can simply scan the lexemes for a comment and do the semicolon check I described above. You won't get any false positives hits for comment like things in string literals with this approach.
If you have a re-engineering parser that captures comments as part of the AST ( such as our SD Java Front End),
you can mechanically scan the parse tree for comments, feed the comment context back to the parser
to see if the content is code like, and report any that passes that test modulo some size-depedent error rate
(10 errors in 15 characters implies "really is a comment"). Now the "code-like" test requires
the reengineering parser be willing to recognize any substring of the (Java) language.
Our DMS Software Reengineering Toolkit underlying the Java Front End can actually do that, using access to the grammar buried in the front end, as it is willing to start a parse for any language (non)terminal,
and this question is "can you find a sequuence of (non)terminals that consumes the string?".
The lexer and parser approaches are small and big sledgehammers respectively. If OP is going to do this just once, he can stick to the manual regex search. If the problem is to vet the code base repeatedly (needed in big organizations), he'd want a tool that can be run on regular basis.

You can do a search in Eclipse.
All you need to search for is /* and //
However, you will only find the files which contain that expression, and not the actual content which I believe you are after.
However, if you are using Linux you can easily get all the comments with a one liner.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.