How to convert HTML to RTF in Java? - java

(I'm looking for a open source library)

iText I believe has an RTF capability as well as pdf.
http://itextpdf.com/
http://www.java-tips.org/other-api-tips/itext/creating-pdf-rtf-or-document-from-a-java-class-at-ru-2.html

You can convert HTML to RTF using basic Java APIs RTFEditorKit and HTMLEditorKit.
It is not converting new line tags like <br/> and <p> to new line character equivalent in RTF. I have applied external fix for that as shown in following Java code.
private static String convertToRTF(String htmlStr) {
OutputStream os = new ByteArrayOutputStream();
HTMLEditorKit htmlEditorKit = new HTMLEditorKit();
RTFEditorKit rtfEditorKit = new RTFEditorKit();
String rtfStr = null;
htmlStr = htmlStr.replaceAll("<br.*?>","#NEW_LINE#");
htmlStr = htmlStr.replaceAll("</p>","#NEW_LINE#");
htmlStr = htmlStr.replaceAll("<p.*?>","");
InputStream is = new ByteArrayInputStream(htmlStr.getBytes());
try {
Document doc = htmlEditorKit.createDefaultDocument();
htmlEditorKit.read(is, doc, 0);
rtfEditorKit .write(os, doc, 0, doc.getLength());
rtfStr = os.toString();
rtfStr = rtfStr.replaceAll("#NEW_LINE#","\\\\par ");
} catch (IOException e) {
e.printStackTrace();
} catch (BadLocationException e) {
e.printStackTrace();
}
return rtfStr;
}
Here, I am replacing new line equivalent HTML tags to some special string and replacing back to new line representation chars sequence \par in RTF.
If you want to use more effective APIs and you have valid html, you should explore Apache-FOP.
Apache FOP can be used to convert to RTF. Following are some useful links -
http://www.torsten-horn.de/techdocs/java-xsl.htm#XSL-FO-Java
http://html2fo.sourceforge.net/index.html

Related

PDFBox Open PDF file into new browser tab

I am using the pdfbox library 2.0 version. I need to open PDF in new browser tab i.e. Print View.
As if we are migrating from iText to PDFBox below is the existing code with iText.
With below code, there is PDFAction class to achieve same. It is,
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
and to apply print Javascript on doc,
copy.addJavaScript(action);
I need equivalent solution with PDFBox.
Document document = new Document();
try{
outputStream=response.getOutputStream();
// step 2
PdfCopy copy = new PdfCopy(document, outputStream);
// step 3
document.open();
// step 4
PdfReader reader;
int n;
//add print dialog in Pdf Action to open file for preview.
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
// loop over the documents you want to concatenate
Iterator i=mergepdfFileList.iterator();
while(i.hasNext()){
File f =new File((String)i.next());
is=new FileInputStream(f);
reader=new PdfReader(is);
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
is.close();
}
copy.addJavaScript(action);
// step 5
document.close();
}catch(IOException io){
throw io;
}catch(DocumentException e){
throw e;
}catch(Exception e){
throw e;
}finally{
outputStream.close();
}
I also tried with below reference but could not find print() method of PDDocument type.
Reference Link
Please guide me with this.
This is how file looks when display in browser tab:
This code reproduces what your file has, a JavaScript action in the name tree in the JavaScript entry in the name dictionary in the document catalog. ("When the document is opened, all of the actions in this name tree shall be executed, defining JavaScript functions for use by other scripts in the document" - PDF specification) There's probably an easier way to do this, e.g. with an OpenAction.
PDActionJavaScript javascript = new PDActionJavaScript("this.print(true);\n");
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDDocumentNameDictionary names = new PDDocumentNameDictionary(documentCatalog, new COSDictionary());
PDJavascriptNameTreeNode javascriptNameTreeNode = new PDJavascriptNameTreeNode();
Map<String, PDActionJavaScript> map = new HashMap<>();
map.put("0000000000000000", javascript);
javascriptNameTreeNode.setNames(map);
names.setJavascript(javascriptNameTreeNode);
document.getDocumentCatalog().setNames(names);

How do I specify fonts for a specific encoding?

I am using flying saucer library and trying to add a custom font for a specific encoding of letters. So that I could make support for unicode characters.
Here is the link of solution that I follow from official guide of flying saucer library http://flyingsaucerproject.github.io/flyingsaucer/r8/guide/users-guide-R8.html#xil_33.
Below is the code,
public void convertHtmlToPdf(String html, String css, OutputStream out) {
try {
html = correctHtml(html);
html = getFormedHTMLWithCSS(html, css);
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
CleanerProperties cleanerProperties = cleaner.getProperties();
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);
File fontFile = new File("/Verdana.ttf");
FontFactory.register(fontFile.getAbsolutePath());
ITextRenderer r = new ITextRenderer();
r.getFontResolver().addFont(fontFile.getAbsolutePath(), BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
r.setDocumentFromString(cleanedHtml);
r.layout();
r.createPDF(out);
r.finishPDF();
} catch (Exception e) {
e.printStackTrace();
logger.error(e.getMessage(), e);
}
}
But Still I am unable to encode certain characters. Like,
'■' : '■',
'▲' : '▲',
For '■' i am getting &x25a0; in generated pdf, and likewise for other characters that I try to encode.

read docx document using java

I have a project steganography to hide docx document into jpeg image. Using apache POI, I can run it and read docx document but only letters can be read.
Even though there are pictures in it.
Here is the code
FileInputStream in = null;
try
{
in = new FileInputStream(directory);
XWPFDocument datax = new XWPFDocument(in);
XWPFWordExtractor extract = new XWPFWordExtractor(datax);
String DataFinal = extract.getText();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = null;
this.isi_file = extract.getText();
}
catch (IOException x) {}
System.out.println("isi :" + this.isi_file);
How can I read all component in the docx document using java? Please help me and thank you for your helping.
Please check documentation for XWPFDocument class. It contains some useful methods, for example:
getAllPictures() returns list of all pictures in document;
getTables() returns list of all tables in document.
In your code snippet exists line XWPFDocument datax = new XWPFDocument(in);. So after that line your can write some code like:
// process all pictures in document
for (XWPFPictureData picture : datax.getAllPictures()) {
// get each picture as byte array
byte[] pictureData = picture.getData();
// process picture somehow
...
}

convert html to ppt using java aspose library

i am passing html code to a variable in java. using aspose library, the html code should be executed and rendered into ppt (i am also giving the reference to css in the html).
appreciated if the ppt is editable.
Please use the following java equivalent code on your end.
public static void main(String[] args) throws Exception {
// The path to the documents directory.
String dataDir ="C:\\html\\";
// Create Empty presentation instance
Presentation pres = new Presentation();
// Access the default first slide of presentation
ISlide slide = pres.getSlides().get_Item(0);
// Adding the AutoShape to accommodate the HTML content
IAutoShape ashape = slide.getShapes().addAutoShape(ShapeType.Rectangle, 10, 10, (float) pres.getSlideSize().getSize().getWidth(), (float) pres.getSlideSize().getSize().getHeight());
ashape.getFillFormat().setFillType(FillType.NoFill);
// Adding text frame to the shape
ashape.addTextFrame("");
// Clearing all paragraphs in added text frame
ashape.getTextFrame().getParagraphs().clear();
// Loading the HTML file using InputStream
InputStream inputStream = new FileInputStream(dataDir + "file.html");
Reader reader = new InputStreamReader(inputStream);
int data = reader.read();
String content = ReadFile(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.getTextFrame().getParagraphs().addFromHtml(content);
// Saving Presentation
pres.save(dataDir + "output.pptx", SaveFormat.Pptx);
}
public static String ReadFile(String FileName) throws Exception {
File file = new File(FileName);
StringBuilder contents = new StringBuilder();
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text = null;
// repeat until all lines is read
while ((text = reader.readLine()) != null) {
contents.append(text).append(System.getProperty("line.separator"));
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (reader != null) {
reader.close();
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
return contents.toString();
}
#Balchandar Reddy,
I have observed your comments and like to share that ImportingHTMLTextInParagraphs.class points to path of file. I have updated the code relate to this.
Secondly, you need to call import com.aspose.slides.IAutoShape on your end to resolve the issue.
I have observed your requirements and regret to share that Aspose.Slides which is an API for managing PowerPoint slides, does not support feature for converting HTML to PPT/PPTX. However, it supports importing HTML text inside slide text frames that you may use.
// Create Empty presentation instance// Create Empty presentation instance
using (Presentation pres = new Presentation())
{
// Acesss the default first slide of presentation
ISlide slide = pres.Slides[0];
// Adding the AutoShape to accomodate the HTML content
IAutoShape ashape = slide.Shapes.AddAutoShape(ShapeType.Rectangle, 10, 10, pres.SlideSize.Size.Width - 20, pres.SlideSize.Size.Height - 10);
ashape.FillFormat.FillType = FillType.NoFill;
// Adding text frame to the shape
ashape.AddTextFrame("");
// Clearing all paragraphs in added text frame
ashape.TextFrame.Paragraphs.Clear();
// Loading the HTML file using stream reader
TextReader tr = new StreamReader(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.TextFrame.Paragraphs.AddFromHtml(tr.ReadToEnd());
// Saving Presentation
pres.Save("output_out.pptx", Aspose.Slides.Export.SaveFormat.Pptx);
}
I am working as Support developer/ Evangelist at Aspose.

How to add metadata to PDF document using PDFbox?

I have an input stream of a PDF document available to me. I would like to add subject metadata to the document and then save it. I'm not sure how to do this.
I came across a sample recipe here: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
However, it is still fuzzy. Below is what I'm trying and places where I have questions
PDDocument doc = PDDocument.load(myInputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
InputStream newXMPData = ...; //what goes here? How can I add subject tag?
PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
catalog.setMetadata( newMetadata );
//does anything else need to happen to save the document??
//I would like an outputstream of the document (with metadata) so that I can save it to an S3 bucket
The following code sets the title of a PDF document, but it should be adaptable to work with other properties as well:
public static byte[] insertTitlePdf(byte[] documentBytes, String title) {
try {
PDDocument document = PDDocument.load(documentBytes);
PDDocumentInformation info = document.getDocumentInformation();
info.setTitle(title);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
return baos.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
Apache PDFBox is needed, so import it to e.g. Maven with:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
Add a title with:
byte[] documentBytesWithTitle = insertTitlePdf(documentBytes, "Some fancy title");
Display it in the browser with (JSF example):
<object class="pdf" data="data:application/pdf;base64,#{myBean.getDocumentBytesWithTitleAsBase64()}" type="application/pdf">Document could not be loaded</object>
Result (Chrome):
Another much easier way to do this would be to use the built-in Document Information object:
PDDocument inputDoc = // your doc
inputDoc.getDocumentInformation().setCreator("Some meta");
inputDoc.getDocumentInformation().setCustomMetadataValue("fieldName", "fieldValue");
This also has the benefit of not requiring the xmpbox library.
This answer uses xmpbox and comes from the AddMetadataFromDocInfo example in the source code download:
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
dc.setDescription("descr");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
doc.getDocumentCatalog().setMetadata(metadata);

Categories