Converting HTML files into PDF - java

I am using the following code to generate a PDF file of the HTML Report
String url = new File("Test.html").toURI().toURL().toString();
OutputStream os = new FileOutputStream("Test.pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
os.close();
I was able to use it on sample HTML files to convert to pdf. But when it comes to my real usage, the HTML content consists of various special symbols, like &,<,> that can't be parsed by XML.
I tried using CDATA, while generating HTML itself, but later found that the text around CDATA is not visible in HMTL.
Does anyone have a solution for this?

Have you tried to print to pdf from the browser? Google primo pdf for a program that we'll let you do it.

I don't know if this will help you, but you can use StringEscapeUtils from apache-commons. It has methods for escape and unescape HTML (you may use them to pre-process your HTML before PDF generation).

Related

Adding pages to PDF/A file with PDFBox without losing PDF/A validity

I'm developing a Java application that has to process a folder with PDF/A files, adding a page with some information to each of them using Apache's PDFBox library. The problem is that the output PDF file after adding the information is not PDF/A anymore. This is a validation test from the website: https://www.pdf-online.com/osa/validate.aspx:
And this is the relevant part of the code that I use to generate the PDF file:
String pdfFileName = this.baseFolder+this.extendedPDFFileName;
File file = new File(pdfFileName);
PDDocument pdfFile = PDDocument.load(file);
PDPage pag = new PDPage();
// As a test, simply adding a page makes the PDF unvalid as PDF/A
pdfFile.addPage(pag);
pdfFile.save(file);
pdfFile.close();
What could I do to keep the PDF/A format validity? Thanks in advance,
As Tilman Hausherr suggested, the problem has been solved by adding a PDResources object to the new page, like this:
pag.setResources(new PDResources());
Now I'm having troubles with the embedded fonts, but this is another question :)
Many thanks!
You create a normal PDF in your code, you should create a valid PDF/A from the start.
Here's a link: https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Convert html to pdf automatically using ASPOSE in Java

I am trying to convert html to pdf using aspose,also i have to use PageSize A1,A2,A3,A4 .this is worked perfectly..but i dont want set pagesize for pdf generation.So far i have tried below code
HtmlLoadOptions htmloptions = new HtmlLoadOptions(basePath);
htmloptions.getPageInfo().setWidth(PageSize.getA2().getWidth());
htmloptions.getPageInfo().setHeight(PageSize.getA2().getHeight());
// Load HTML file
Document doc = new Document(basePath + "400010_DOC002_L_10_2508016.html", htmloptions);
// Save HTML file
doc.save("D:/Web+URL_output.pdf");
Can anyone suggest with out set page size i have convert html to pdf conversion ? or else please let me know what tools are available for this. Please let me know any other tools for this conversion.
#Shankar, you may use the below code sample in order to convert an HTML file to a PDF file without setting page size. By default, the page size of the rendered PDF file will be as of the A4 page size.
Simply omit the code which is setting a page size, else remains the same.
HtmlLoadOptions htmloptions = new HtmlLoadOptions(basePath);
// Load HTML file
Document doc = new Document(basePath + "400010_DOC002_L_10_2508016.html", htmloptions);
// Save HTML file
doc.save("D:/Web+URL_output.pdf");
Please let us know if you need any further assistance. I work with Aspose as Developer Evangelist.

JAXB: can not read Japanese characters properly

I am having a program which supports internationalisation. I have entries where input is provided in Japanese characters. On exporting that entry in XML, using JAXB, Japanese characters looks fine in the file. Proper character is been exported in the XML file. I am facing issue when unmarshal that XML file to get back data as Java object. I am not get proper unmarshalled value of japanese character.
Here is my marshalling code:
OutputStreamWriter outputWriter = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
JAXB.marshal(xmlobj, outputWriter);
Unmarshalling code:
InputStreamReader inputReader = new InputStreamReader(xml, "UTF-8");
xmlobj = JAXB.unmarshal(inputReader, <JAVA_CLASS_TO_UNMARSHAL>);
The text I am marshalling-unmarshalling is: 説明_1
It displays correctly on fetching this record and display it to browser, but in case of JAXB unmarshalling incorrect value is displayed. After converting it to HTML compatible code I got value 説明_1, which is actually correct conversion of Japanese characters. And it should appear as proper character on the browser, but it does not do so. It displays as HTML codes 説明_1 to the browser.
Any guess where I am doing wrong?
If the HTML contains
<html>
<body>
説明_1<br>
</body>
</html>
and good browser like Firefox (I have 31.0) should display 説明_1. Can you add the HTML section to your question?
If your browser isn't fit to display these characters, you should see something like .
You report that you see 説明_1, which is possible if your HTML text contains
&#35500;&#26126;_1<br>
which would mean that the transformation to HTML hasn't worked correctly.
Once more: check your HTML code, and how it is produced from the XML.
Try using UTF-8 in your HTML Header. Note that just changing the charset in the header won't convert the content — you need to make sure that the content is actually UTF-8 as well.
<Meta http-equiv = "Content-Type" content = "text / html; charset = UTF-8" >
Comment specified by Wundwin Born has solved the issue. I forgot to unescape string.
Here is the code snippet.
org.apache.commons.lang.StringEscapeUtils.unescapeHtml(xmlString);

docx4j convert docx in wrong html format

I have some problems with docx4j samples. I need to convert a file from docx in html format and back. I'm try to compile ConvertInXHTMLDocument.java sample. Html file it creates fine, but when trying to convert it back into docx, throws an exception that is missing close tags (META, img etc). Has anyone encountered this problem?
XHTMLImporter requires its input to be well-formed XML. So you need to ensure you don't have missing close tags (META, img etc); if you do, run JTidy or similar first.
docx4j's (X)HTML output can either be HTML or XML. From 3.0, the property Convert.Out.HTML.OutputMethodXML will control which.

Categories