PDF file generation from XML or HTML [closed]

PDF file generation from XML or HTML [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Are there any API/solution to generate PDF report from XML file data and definition.
For example the XML definition/data could be:
<pdf>
<paragraph font="Arial">Title of report</paragraph>
</pdf>
Converting HTML to PDF will be also a good solution I feel.
Currently we write Java code using iText API. I want to externalize the code so that non-technical person can edit and make changes.

Have a look at Apache FOP. Use an XSLT stylesheet to convert the XML (or XHTML) into XSL-FO. Then use FOP to read the XSL-FO document and format it to a PDF document (see Hello World with FOP).
Apache FOP can use a lot of memory for large documents (e.g., a 200-page PDF), which may require tweaking the JVM memory settings.

iText has a facility for generating PDFs from XML (and HTML, I think). Here is the DTD, but I found it difficult to sort out. Aside from that, I never found any good documentation on what is supported. My approach was to look at the source for SAXiTextHandler and ElementTags to figure out what was acceptable. Although not ideal, it is pretty straight-forward.
<itext orientation="portrait" pagesize="LETTER" top="36" bottom="36" left="36" right="36" title="My Example" subject="My Subject" author="Me">
<paragraph size="8" >This is an example</paragraph>
</itext>
...
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfWriter;
import com.lowagie.text.xml.SAXiTextHandler;
...
String inXml = ""; //use xml above as an example
ByteArrayOutputStream temp = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter writer = null;
try
{
writer = PdfWriter.getInstance(document, temp);
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
parser.parse(new ByteArrayInputStream(inXml), new SAXiTextHandler(document));
}
catch (Exception e)
{
// instead, catch the proper exception and do something meaningful
e.printStackTrace();
}
finally
{
if (writer != null)
{
try
{
writer.close();
}
catch (Exception ignore)
{
// ignore
}
} // if
}
//temp holds the PDF

Take a look at JasperReports, it uses iText to export files i think, and its IDE is simple and can be used by non-programmers.
Edit: i forgot to mention, you can use JasperReports engine directly in your application, or you can use iReport "Designer for JasperReports"

You will want to use a well supported XML format for this, as it will allow you to leverage the work of others.
A well supported XML format is DocBook XML - http://www.docbook.org/ - and this - http://sagehill.net/docbookxsl/index.html - appears to be a good resource on doing the XML -> PDF using XSLT with the Docbook style sheets and other formats.
This approach allows you to use any XSLT processor and any XSL/FO processor to get to your result. This gives you easy scriptability as well as the freedom to switch implementations if needed - notably older Apache FOP implementations degraded badly when the resulting PDF got "too large".

Prince is one of the best tools out there. It uses CSS for styles, so if you appreciate that method of separating data from display (read: your users are able to do that too), it may be a very good fit for you. (The control over display that browsers offer through CSS is, by comparision, primitive.)

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

High performace HTML parsing library [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();

Another library that might be useful for HTML processing is jsoup.
Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* #author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}

Several years ago I used JTidy for the same purpose:
http://jtidy.sourceforge.net/
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.

The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.

HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1

Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.

The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

how to create an odt file programmatically with java?

How can I create an odt (LibreOffice/OpenOffice Writer) file with Java programmatically? A "hello world" example will be sufficient. I looked at the OpenOffice website but the documentation wasn't clear.

Take a look at ODFDOM - the OpenDocument API
ODFDOM is a free OpenDocument Format
(ODF) library. Its purpose is to
provide an easy common way to create,
access and manipulate ODF files,
without requiring detailed knowledge
of the ODF specification. It is
designed to provide the ODF developer
community with an easy lightwork
programming API portable to any
object-oriented language.
The current reference implementation
is written in Java.
// Create a text document from a standard template (empty documents within the JAR)
OdfTextDocument odt = OdfTextDocument.newTextDocument();
// Append text to the end of the document.
odt.addText("This is my very first ODF test");
// Save document
odt.save("MyFilename.odt");
later
As of this writing (2016-02), we are told that these classes are deprecated... big time, and the OdfTextDocument API documentation tells you:
As of release 0.8.8, replaced by org.odftoolkit.simple.TextDocument in
Simple API.
This means you still include the same active .jar file in your project, simple-odf-0.8.1-incubating-jar-with-dependencies.jar, but you want to be unpacking the following .jar to get the documentation: simple-odf-0.8.1-incubating-javadoc.jar, rather than odfdom-java-0.8.10-incubating-javadoc.jar.
Incidentally, the documentation link downloads a bunch of jar files inside a .zip which says "0.6.1"... but most of the stuff inside appears to be more like 0.8.1. I have no idea why they say "as of 0.8.8" in the documentation for the "deprecated" classes: just about everything is already marked deprecated.
The equivalent simple code to the above is then:
odt_doc = org.odftoolkit.simple.TextDocument.newTextDocument()
para = odt_doc.getParagraphByIndex( 0, False )
para.appendTextContent( 'stuff and nonsense' )
odt_doc.save( 'mySpankingNewFile.odt' )
PS am using Jython, but the Java should be obvious.

I have not tried it, but using JOpenDocument may be an option. (It seems to be a pure Java library to generate OpenDocument files.)

A complement of previously given solutions would be JODReports, which allows creating office documents and reports in ODT format (from templates, composed using the LibreOffice/OpenOffice.org Writer word processor).
DocumentTemplateFactory templateFactory = new DocumentTemplateFactory();
DocumentTemplate template = templateFactory .getTemplate(new File("template.odt"));
Map data = new HashMap();
data.put("title", "Title of my doc");
data.put("picture", new RenderedImageSource(ImageIO.read(new File("/tmp/lena.png"))));
data.put("answer", "42");
//...
template.createDocument(data, new FileOutputStream("output.odt"));
Optionally the documents can then be converted to PDF, Word, RTF, etc. with JODConverter.
Edit/update
Here you can find a sample project using JODReports (with non-trivial formatting cases).

I have written a jruby DSL for programmatically manipulating ODF documents.
https://github.com/noah/ocelot
It's not strictly java, but it aims to be much simpler to use than the ODFDOM.
Creating a hello world document is as easy as:
% cat examples/hello.rb
include OCELOT
Text::create "hello" do
paragraph "Hello, world!"
end
There are a few more examples (including a spreadsheet example or two) here.

I have been searching for an answer about this question for myself. I am working on a project for generating documents with different formats and I was in a bad need for library to generate ODT files.
I finally can say the that ODFToolkit with the latest version of the simple-odf library is the answer for generating text documents.
You can find the the official page here :
Apache ODF Toolkit(Incubating) - Simple API
Here is a page to download version 0.8.1 (the latest version of Simple API) as I didn't find the latest version at the official page, only version 0.6.1
And here you can find Apache ODF Toolkit (incubating) cookbook

You can try using JasperReports to generate your reports, then export it to ODS. The nice thing about this approach is
you get broad support for all JasperReports output formats, e.g. PDF, XLS, HTML, etc.
Jasper Studio makes it easy to design your reports

The ODF Toolkit project (code hosted at Github) is the new home of the former ODFDOM project, which was until 2018-11-27 a Apache Incubator project.

the solution may be JODF Java API Independentsoft company.
For example, if we want to create an Open Document file using this Java API we could do the following:
import com.independentsoft.office.odf.Paragraph;
import com.independentsoft.office.odf.TextDocument;
public class Example {
public static void main(String[] args)
{
try
{
TextDocument doc = new TextDocument();
Paragraph p1 = new Paragraph();
p1.add("Hello World");
doc.getBody().add(p1);
doc.save("c:\\test\\output.odt", true);
}
catch (Exception e)
{
System.out.println(e.getMessage());
e.printStackTrace();
}
}
}
There are also .NET solutions for this API.

With Java: replace string in MS Word file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
We need a Java library to replace strings in MS Word files.
Can anyone suggest?

While there is MS Word support in Apache POI, it is not very good. Loading and then saving any file with other than the most basic formatting will likely garble the layout. You should try it out though, maybe it works for you.
There are a number of commercial libraries as well, but I don't know if any of them are any better.
The crappy "solution" I had to settle for when working on a similar requirement recently was using the DOCX format, opening the ZIP container, reading the document XML, and then replacing my markers with the right texts. This does work for replacing simple bits of text without paragraphs etc.
private static final String WORD_TEMPLATE_PATH = "word/word_template.docx";
private static final String DOCUMENT_XML = "word/document.xml";
/*....*/
final Resource templateFile = new ClassPathResource(WORD_TEMPLATE_PATH);
final ZipInputStream zipIn = new ZipInputStream(templateFile.getInputStream());
final ZipOutputStream zipOut = new ZipOutputStream(output);
ZipEntry inEntry;
while ((inEntry = zipIn.getNextEntry()) != null) {
final ZipEntry outEntry = new ZipEntry(inEntry.getName());
zipOut.putNextEntry(outEntry);
if (inEntry.getName().equals(DOCUMENT_XML)) {
final String contentIn = IOUtils.toString(zipIn, UTF_8);
final String outContent = this.processContent(new StringReader(contentIn));
IOUtils.write(outContent, zipOut, UTF_8);
} else {
IOUtils.copy(zipIn, zipOut);
}
zipOut.closeEntry();
}
zipIn.close();
zipOut.finish();
I'm not proud of it, but it works.

I would suggest the Apache POI library:
http://poi.apache.org/
Looking more - it looks like it hasn't been kept up to date - Boo! It may be complete enough now to do what you need however.

Try this one: http://www.dancrintea.ro/doc-to-pdf/
Besides replacing strings in ms word files can also:
- read/write Excel files using simplified API like: getCell(x,y) and setCell(x,y,string)
- hide Excel sheets(secondary calculations for example)
- replace images in DOC, ODT and SXW files
- and convert:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf

I would take a look at the Apache POI project. This is what I have used to interact with MS documents in the past.
http://poi.apache.org/

Thanks all. I am gonna try http://www.dancrintea.ro/doc-to-pdf/
because I need to convert classic DOC file(binary) and not DOCX(zip format).

What is the best approach to implement search for searching documents (PDF, XML, HTML, MS Word)?

What could be a good way to code a search functionality for searching documents in a java web application?
Is 'tagged search' a good fit for such kind of search functionality?

Why re-invent the wheel?
Check out Apache Lucene.
Also, search Stack Overflow for "full text search" and you'll find a lot of other very similar questions. Here's another one, for example:
How do I implement Search Functionality in a website?

You could use Solr which sits on top of Lucene, and is a real web search engine application, while the Lucene is a library. However neither Solr or Lucene parse the Word document, pdf, etc. to extract meta data information. It's necessary to index the document based on a pre-defined document schema.

As for extracting the text content of Office documents (which you need to do before giving it to Lucene), there is the Apache Tika project, which supports quite a few file formats, including Microsoft's.

Using Tika, the code to get the text from a file is quite simple:
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;
// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()
So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.

Just for updating
There is another alternative instead of Solr, called "ElasticSearch", its a project with good capabilities, similar to Solr, but schemaless.
Both projecs are build on top of Lucene.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.