Merging two .odt files from code - java

How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?

I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH

Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

File wildcard use *

I am trying to read a file which has name: K2ssal.timestamp.
I want to handle the time stamp part of the file name as wildcard.
How can I achieve this ?
tried * after file name but not working.
var getK2SSal: Iterator[String] = Source.fromFile("C://Users/nrakhad/Desktop/Work/Data stage migration/Input files/K2Ssal.*").getLines()
You can use Files.newDirectoryStream with directory + glob:
import java.nio.file.{Paths, Files}
val yourFile = Files.newDirectoryStream(
Paths.get("/path/to/the/directory"), // where is the file?
"K2Ssal.*" // glob of the file name
).iterator.next // get first match
Misconception on your end: unless the library call is specifically implemented to do so, using a wildcard simply doesn't work like you expect it to.
Meaning: a file system doesn't know about wildcards. It only knows about existing files and folders. The fact that you can put * on certain commands, and that the wildcard is replaced with file names is a property of the tool(s) you are using. And most often, programming APIs that allow you to query the file system do not include that special wild card handling.
In other words: there is no sense in adding that asterisk like that.
You have to step back and write code that actively searches for files itself. Here are some examples for scala.
You can read the directory and filter on files based upon the string.
val l = new File("""C://Users/nrakhad/Desktop/Work/Data stage migration/Input files/""").listFiles
val s = l.filter(_.toString.contains("K2Ssal."))

Apache Commons Configuration : read a .properties file and rewrite it with no change

Since I do not know of a better solution, I am currently writing small Java classes to process .properties file to merge them, remove duplicate properties, override properties, etc. (I need to process many files and a huge number of properties).
org.apache.commons.configuration.PropertiesConfiguration works great for reading a properties file (using org.apache.commons.configuration.AbstractFileConfiguration.load(InputStream, String), however if I rewrite the file using org.apache.commons.configuration.AbstractFileConfiguration.save(File), I have two problems:
the original layout and comments are lost. I am going to try the PropertiesConfigurationLayout, which is supposed to help here (see How to overwrite one property in .properties without overwriting the whole file?) and post the results
the properties are slightly modified. Accents é and è are rewritten as unicode characters (\u00E9), which I do not want. Afaik .properties files are generally ISO-8859-1 (and I think mine are), so escaping shouldn't be necessary.
Specifying the encoding when calling org.apache.commons.configuration.AbstractFileConfiguration.load(InputStream, String) does not make a difference, because when it is not specified, the same encoding is used by default anyway (private static final String DEFAULT_ENCODING = "ISO-8859-1";). What could I do about that ?
Doing some tests I think you can do what you want, using CombinedConfiguration plus a OverrideCombiner. Basically the properties will be merged automatically and the trick for the layout is to get the layout from one of the loaded files:
CombinedConfiguration props = new CombinedConfiguration();
final PropertiesConfiguration defaultsProps = new PropertiesConfiguration(new File("/tmp/default.properties"));
final PropertiesConfiguration customProps = new PropertiesConfiguration(new File("/tmp/custom.properties"));
props.setNodeCombiner(new OverrideCombiner());
props.addConfiguration(customProps); //first should be loaded the override values
props.addConfiguration(defaultsProps); // last your 'default' values
PropertiesConfiguration finalFile = new PropertiesConfiguration();
finalFile.append(props);
PropertiesConfigurationLayout layout = new PropertiesConfigurationLayout(finalFile, defaultsProps.getLayout()); //here we copy the layout from the 'base file'
layout.save(new FileWriter(new File("/tmp/app.properties")));
The issue with the encoding I don't know if its possible to find a solution.

Commons configuration library to add elements

I am using the apache commons configuration library to read a configuration xml and it works nicely. However, I am not able to modify the value of the elements or add new ones.
To read the xml I use the following code:
XMLConfiguration config = new XMLConfiguration(dnsXmlPath);
boolean enabled = config.getBoolean("enabled", true));
int size = config.getInt("size");
To write I am trying to use:
config.setProperty("newProperty", "valueNewProperty");
config.save();
If I call config.getString("newProperty"), I obtain "valueNewProperty", but the xml has not been changed.
Obviously it is not the right way or I am missing something, because it does not work.
Could anybody tell me how to do this?
Thanks in advance.
You're modifying xml structure in memory
The parsed document will be stored keeping its structure. The class also tries to preserve as much information from the loaded XML document as possible, including comments and processing instructions. These will be contained in documents created by the save() methods, too.
Like other file based configuration classes this class maintains the name and path to the loaded configuration file. These properties can be altered using several setter methods, but they are not modified by save() and load() methods. If XML documents contain relative paths to other documents (e.g. to a DTD), these references are resolved based on the path set for this configuration.
You need to use XMLConfiguration.html#save(java.io.Writer) method
For example, after you've done all your modifications save it:
config.save(new PrintWriter(new File(dnsXmlPath)));
EDIT
As mentioned in comment, calling config.load() before calling setProperty() method fixes the issue.
I solved it with the following lines. I was missing the config.load().
XMLConfiguration config = new XMLConfiguration(dnsXmlPath);
config.load();
config.setProperty("newProperty", "valueNewProperty");
config.save();
It is true though that you can used the next line instead of config.save() and works the same.
config.save(new PrintWriter(new File(dnsXmlPath)));

How to extract data from a lot of URLs?

I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?
*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?
<?xml version="1.0"?>
-<playlist xmlns="http://xspf.org/ns/0/" version="1">
-<tracklist>
<location>http://radiotool.com/fransn.mp3</location>
<title>France, Paris radio 104.5</title>
</tracklist>
</playlist>
*edit: Fixed XML.
It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:
declare variable urls as xs:string* external;
<data>{
for $u in $urls return doc($u)//*:tracklist
}</data>
The Java code would be something like:
Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();
Have a look at the JSoup library here: http://jsoup.org/
It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

Categories