With Java: replace string in MS Word file [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
We need a Java library to replace strings in MS Word files.
Can anyone suggest?

While there is MS Word support in Apache POI, it is not very good. Loading and then saving any file with other than the most basic formatting will likely garble the layout. You should try it out though, maybe it works for you.
There are a number of commercial libraries as well, but I don't know if any of them are any better.
The crappy "solution" I had to settle for when working on a similar requirement recently was using the DOCX format, opening the ZIP container, reading the document XML, and then replacing my markers with the right texts. This does work for replacing simple bits of text without paragraphs etc.
private static final String WORD_TEMPLATE_PATH = "word/word_template.docx";
private static final String DOCUMENT_XML = "word/document.xml";
/*....*/
final Resource templateFile = new ClassPathResource(WORD_TEMPLATE_PATH);
final ZipInputStream zipIn = new ZipInputStream(templateFile.getInputStream());
final ZipOutputStream zipOut = new ZipOutputStream(output);
ZipEntry inEntry;
while ((inEntry = zipIn.getNextEntry()) != null) {
final ZipEntry outEntry = new ZipEntry(inEntry.getName());
zipOut.putNextEntry(outEntry);
if (inEntry.getName().equals(DOCUMENT_XML)) {
final String contentIn = IOUtils.toString(zipIn, UTF_8);
final String outContent = this.processContent(new StringReader(contentIn));
IOUtils.write(outContent, zipOut, UTF_8);
} else {
IOUtils.copy(zipIn, zipOut);
}
zipOut.closeEntry();
}
zipIn.close();
zipOut.finish();
I'm not proud of it, but it works.

I would suggest the Apache POI library:
http://poi.apache.org/
Looking more - it looks like it hasn't been kept up to date - Boo! It may be complete enough now to do what you need however.

Try this one: http://www.dancrintea.ro/doc-to-pdf/
Besides replacing strings in ms word files can also:
- read/write Excel files using simplified API like: getCell(x,y) and setCell(x,y,string)
- hide Excel sheets(secondary calculations for example)
- replace images in DOC, ODT and SXW files
- and convert:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf

I would take a look at the Apache POI project. This is what I have used to interact with MS documents in the past.
http://poi.apache.org/

Thanks all. I am gonna try http://www.dancrintea.ro/doc-to-pdf/
because I need to convert classic DOC file(binary) and not DOCX(zip format).

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Get File Extension for special cases like tar.gz [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need to extract extensions from file names.
I know this can be done for single extensions like .gz or .tar by using filePath.lastIndexOf('.') or using utility methods like FilenameUtils.getExtension(filePath) from Apache commons-io.
But, what if I have a file with an extension like .tar.gz? How can I manage files with extensions that contain . characters?
If you know what extensions are important, you can simply check for them explicitly. You would have a collection of known extensions, like this:
List<String> EXTS = Arrays.asList("tar.gz", "tgz", "gz", "zip");
You could get the (first) longest matching extension like this:
String getExtension(String fileName) {
String found = null;
for (String ext : EXTS) {
if (fileName.endsWith("." + ext)) {
if (found == null || found.length() < ext.length()) {
found = ext;
}
}
}
return found;
}
So calling getExtension("file.tar.gz") would return "tar.gz".
If you have mixed-case names, perhaps try changing the check to filename.toLowerCase().endsWith("." + ext) inside the loop.
A file can just have one extension!
If you have a file test.tar.gz,
.gz is the extension and
test.tar is the Basename!
.tar in this case is part of the basename, not the part of the extension!
If you like to have a file encoded as tar and gz you should call it .tgz. To use a .tar.gz is bad practice, if you need to handle thesse files you should make a workaround like rename the file to test.tgz.
Found a simple way. Use substring to get filename only and indexOf instead of lastIndexOf to get first '.' and extension after it
You can get the filename part of the path, split on . and take the final 0, 1, or 2 elements in the array as the extension.
Of course if .tar.* (gz, bz2, etc.) is your only edge case it may be pragmatic to just build a solution that filters filenames for .tar. and use that as the point at which to extract the extension (to include the .tar portion).

PDF file generation from XML or HTML [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Are there any API/solution to generate PDF report from XML file data and definition.
For example the XML definition/data could be:
<pdf>
<paragraph font="Arial">Title of report</paragraph>
</pdf>
Converting HTML to PDF will be also a good solution I feel.
Currently we write Java code using iText API. I want to externalize the code so that non-technical person can edit and make changes.
Have a look at Apache FOP. Use an XSLT stylesheet to convert the XML (or XHTML) into XSL-FO. Then use FOP to read the XSL-FO document and format it to a PDF document (see Hello World with FOP).
Apache FOP can use a lot of memory for large documents (e.g., a 200-page PDF), which may require tweaking the JVM memory settings.
iText has a facility for generating PDFs from XML (and HTML, I think). Here is the DTD, but I found it difficult to sort out. Aside from that, I never found any good documentation on what is supported. My approach was to look at the source for SAXiTextHandler and ElementTags to figure out what was acceptable. Although not ideal, it is pretty straight-forward.
<itext orientation="portrait" pagesize="LETTER" top="36" bottom="36" left="36" right="36" title="My Example" subject="My Subject" author="Me">
<paragraph size="8" >This is an example</paragraph>
</itext>
...
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfWriter;
import com.lowagie.text.xml.SAXiTextHandler;
...
String inXml = ""; //use xml above as an example
ByteArrayOutputStream temp = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter writer = null;
try
{
writer = PdfWriter.getInstance(document, temp);
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
parser.parse(new ByteArrayInputStream(inXml), new SAXiTextHandler(document));
}
catch (Exception e)
{
// instead, catch the proper exception and do something meaningful
e.printStackTrace();
}
finally
{
if (writer != null)
{
try
{
writer.close();
}
catch (Exception ignore)
{
// ignore
}
} // if
}
//temp holds the PDF
Take a look at JasperReports, it uses iText to export files i think, and its IDE is simple and can be used by non-programmers.
Edit: i forgot to mention, you can use JasperReports engine directly in your application, or you can use iReport "Designer for JasperReports"
You will want to use a well supported XML format for this, as it will allow you to leverage the work of others.
A well supported XML format is DocBook XML - http://www.docbook.org/ - and this - http://sagehill.net/docbookxsl/index.html - appears to be a good resource on doing the XML -> PDF using XSLT with the Docbook style sheets and other formats.
This approach allows you to use any XSLT processor and any XSL/FO processor to get to your result. This gives you easy scriptability as well as the freedom to switch implementations if needed - notably older Apache FOP implementations degraded badly when the resulting PDF got "too large".
Prince is one of the best tools out there. It uses CSS for styles, so if you appreciate that method of separating data from display (read: your users are able to do that too), it may be a very good fit for you. (The control over display that browsers offer through CSS is, by comparision, primitive.)

What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word)?

What could be a good way to code a search functionality for searching documents in a java web application?
Is 'tagged search' a good fit for such kind of search functionality?
Why re-invent the wheel?
Check out Apache Lucene.
Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example:
How do I implement Search Functionality in a website?
You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.
As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.
Using Tika, the code to get the text from a file is quite simple:
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;
// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()
So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.
Just for updating
There is another alternative instead of Solr, called "ElasticSearch", its a project with good capabilities, similar to Solr, but schemaless.
Both projecs are build on top of Lucene.

A good library for converting PDF to TIFF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need a Java library to convert PDFs to TIFF images. The PDFs are faxes, and I will be converting to TIFF so that I can then do barcode recognition on the image. Can anyone recommend a good free open source library for conversion from PDF to TIFF?
I can't recommend any code library, but it's easy to use GhostScript to convert PDF into bitmap formats. I've personally used the script below (which also uses the netpbm utilties) to convert the first page of a PDF into a JPEG thumbnail:
#!/bin/sh
/opt/local/bin/gs -q -dLastPage=1 -dNOPAUSE -dBATCH -dSAFER -r300 \
-sDEVICE=pnmraw -sOutputFile=- $* |
pnmcrop |
pnmscale -width 240 |
cjpeg
You can use -sDEVICE=tiff... to get direct TIFF output in various TIFF sub-formats from GhostScript.
Disclaimer: I work for Atalasoft
We have an SDK that can convert PDF to TIFF. The rendering is powered by Foxit software which makes a very powerful and efficient PDF renderer.
we here also doing conversion PDF -> G3 tiffs with high and low res. From my experience the best tool you can have is Adobe PDF SDK, the only problem with it is its insane price. So we don't use it.
what works fine for us is ghostscript, last versions are pretty much robust and do render correctly majority of the pdfs. And we have quite a few of them coming during the day. In production conversion is done using the gsdll32.dll; but if you want to try it use the following command line:
gswin32c -dNOPAUSE -dBATCH -dMaxStripSize=8192 -sDEVICE=tiffg3 -r204x196 -dDITHERPPI=200 -sOutputFile=test.tif prefix.ps test.pdf
it would convert your PDF into the high res G3 TIFF. and prefix.ps code is here:
<< currentpagedevice /InputAttributes get
0 1 2 index length 1 sub {1 index exch undef } for
/InputAttributes exch dup 0 <</PageSize [0 0 612 1728]>> put
/Policies << /PageSize 3 >> >> setpagedevice
another thing about this sdk is that it's open source; you're getting both c and ps (postscript) source code for it. Also if you're going with another tool check what kind of an engine they have to power the pdf rendering, it could happen they are using gs for it; like for instance LeadTools does.
hope this helps, regards
You can use the icepdf library (Apache 2.0 License).
They even provide this exact use case as one of their example source code:
http://wiki.icesoft.org/display/PDF/Multi-page+Tiff+Capture
Maybe it is not neccessary to convert the PDF into TIFF. The fax will most likely be an embedded image in the PDF, so you could just extract these images again. That should be possible with the already mentioned iText library.
I don't know if this is easier than the other approach.
Take a look at Apache PDFBox - A Java PDF Library
No Itext can not convert PDFs to Tiff.
However, there are commercial libraries that can do that. jPDFImages is a 100% java library that can convert PDF to images in TIFF, JPEG or PNG formats (and maybe JBIG? I am not sure). It can also do the reverse, create PDF from images. It starts at $300 for a server.
Here is a good article and wrapper classes for using GhostScript with C# .NET...ended up using this in production
http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx
I have some great experience with iText (now, I'm using 5.0.6 version) and this is the code for tiff convertion into pdf:
private static String convertTiff2Pdf(String tiff) {
// target path PDF
String pdf = null;
try {
pdf = tiff.substring(0, tiff.lastIndexOf('.') + 1) + "pdf";
// New document A4 standard (LETTER)
Document document = new Document(PageSize.LETTER, 0, 0, 0, 0);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(pdf));
int pages = 0;
document.open();
PdfContentByte cb = writer.getDirectContent();
RandomAccessFileOrArray ra = null;
int comps = 0;
ra = new RandomAccessFileOrArray(tiff);
comps = TiffImage.getNumberOfPages(ra);
// Convertion statement
for (int c = 0; c < comps; ++c) {
Image img = TiffImage.getTiffImage(ra, c + 1);
if (img != null) {
System.out.println("page " + (c + 1));
img.scalePercent(7200f / img.getDpiX(), 7200f / img.getDpiY());
document.setPageSize(new Rectangle(img.getScaledWidth(), img.getScaledHeight()));
img.setAbsolutePosition(0, 0);
cb.addImage(img);
document.newPage();
++pages;
}
}
ra.close();
document.close();
} catch (Exception e) {
logger.error("Convert fail");
logger.debug("", e);
pdf = null;
}
logger.debug("[" + tiff + "] -> [" + pdf + "] OK");
return pdf;
}

Categories