Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I want to generate a PDF document from a "raw" email. This email could containt html or just text. I don't care for attachments.
The resulting pdf should contain the proper formatting (from css and html) and also embedded images.
My first idea was to render the email using an email client like thunderbird and then print it to pdf. Does thunderbird offer such an API or are there java libraries available to print an email to pdf?
I've found a better solution to the one I posted before. saving the email to html, then use jtidy to clean it up to xhtml. and lastly use flying saucer html renderer to save it into pdf.
Here is an example I wrote:
import com.lowagie.text.DocumentException;
import org.w3c.tidy.Tidy;
import org.xhtmlrenderer.pdf.ITextRenderer;
import java.io.*;
import java.util.*;
import javax.mail.*;
public class Email2PDF {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty("mail.store.protocol", "imaps");
try {
Session session = Session.getInstance(props, null);
Store store = session.getStore();
//read your latest email
store.connect("imap.gmail.com", "youremail#gmail.com", "password");
Folder inbox = store.getFolder("INBOX");
inbox.open(Folder.READ_ONLY);
Message msg = inbox.getMessage(inbox.getMessageCount());
Multipart mp = (Multipart) msg.getContent();
BodyPart bp = mp.getBodyPart(0);
String filename = msg.getSubject();
FileOutputStream os = new FileOutputStream(filename + ".html");
msg.writeTo(os);
//use jtidy to clean up the html
cleanHtml(filename);
//save it into pdf
createPdf(filename);
} catch (Exception mex) {
mex.printStackTrace();
}
}
public static void cleanHtml(String filename) {
File file = new File(filename + ".html");
InputStream in = null;
try {
in = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
OutputStream out = null;
try {
out = new FileOutputStream(filename + ".xhtml");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
tidy.setForceOutput(true);
org.w3c.dom.Document document = tidy.parseDOM(in, out);
}
public static void createPdf(String filename)
throws IOException, DocumentException {
OutputStream os = new FileOutputStream(filename + ".pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(new File(filename + ".xhtml"));
renderer.layout();
renderer.createPDF(os) ;
os.close();
}
}
Enjoy!
I put a piece of software together that converts eml files to pdf's by parsing (and cleaning) the mime/structure, converting it to html and then use wkhtmltopdf to convert it to a pdf file.
It also handles inline images, corrupt mime headers and can use a proxy.
The code is available at github under apache V2 license.
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.*;
import javax.mail.*;
public class Email2PDF {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty("mail.store.protocol", "imaps");
try {
Session session = Session.getInstance(props, null);
Store store = session.getStore();
store.connect("imap.gmail.com", "youremail#gmail.com", "password");
Folder inbox = store.getFolder("INBOX");
inbox.open(Folder.READ_ONLY);
Message msg = inbox.getMessage(inbox.getMessageCount());
Multipart mp = (Multipart) msg.getContent();
BodyPart bp = mp.getBodyPart(0);
createPdf(msg.getSubject(), (String) bp.getContent());
} catch (Exception mex) {
mex.printStackTrace();
}
}
public static void createPdf(String filename, String body)
throws DocumentException, IOException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(filename + ".pdf"));
document.open();
document.add(new Paragraph(body));
document.close();
}
}
I've used itext as the pdf library
You can read HTML content using email client and then use iText to convert it in to PDF
Look into fpdf and fpdi, both free libraries for PHP are used in the creation of PDF docs.
Since the SMTP protocol has conventions, actually strict rules, you can always count on the first empty line to be the before the content of the message. So you can definitely parse everything after the first part of the line to get the entirety of the message.
For embedded images, you'll need a base 64 decoder (usually) or some other decoder based on the email's attachment encoding type to transform the data into a human readable image.
You could try the Apache PDFbox library.
It seems to have a nice API and it also supports printing. PrintPDF
You would have to run the print command from CLI with your file as a parameter.
Edit: It is Java and open-source.
Hope it helps!
Related
I have very little experience in JAVA (working on my first real program) been looking for a solution for hours. I have hacked together a small program to download PDF files from a link. It works fine for most links but some of them just don't work.
The connection type for all the links that works show up as application/pdf but some links show a connection of text/html for some reason.
I keep trying to rewrite the code using whatever I can find online but I keep getting the same result.
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.net.ConnectException;
import java.net.URL;
import java.net.URLConnection;
public class Main {
public static void main(String[] args) throws Exception {
String link = "https://www.menards.com/main/items/media/UNITE051/SDS/SpectracideVegetationKillerReadyToUse2-228-714-8845-SDS-Feb16.pdf";
String fileName = "File Name.pdf";
URL url1 = new URL(link);
try {
URLConnection urlConn = url1.openConnection();
byte[] buffer = new byte[1024];
double downloaded = 0.00;
int read = 0;
System.out.println(urlConn.getContentType()); // This shows as text/html but it should be PDF
FileOutputStream fos1 = new FileOutputStream(fileName);
BufferedInputStream is1 = new BufferedInputStream(urlConn.getInputStream());
BufferedOutputStream bout = new BufferedOutputStream(fos1, 1024);
try {
while ((read = is1.read(buffer, 0, 1024)) >= 0) {
bout.write(buffer, 0, read);
downloaded += read;
}
bout.close();
fos1.flush();
fos1.close();
is1.close();
} catch (Exception e) {}
} catch (Exception e) {}
}
}
I need to be able to download the PDF from the link in the code.
This is what is saved in a text document of the PDF:
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>
The website implemented a check to make sure I was using a browser. I copied the user agent from chrome and it allowed me to download the PDF.
The URL that you are fetching doesn't point to a PDF file. It is pointing to a HTML file which embeds the PDF file. You probably need to closely look at what is the URL to PDF file. You code seems alright.
Just do a cURL on the URL and see. It will most probably return a HTML file.
I am using DOCX4J to convert the DOCX to HTML .I have successfully done the conversion and got the html format.I will be using the html format to embed it as EMAIL body to send an email.But I have some issues which are listed below....
Unable to display images in email body
Losing the spaces and bullets
Please find the code which I have written,
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new java.io.File(resourcePath2));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(imageFolder + resourcePath2 + "_files");
htmlSettings.setImageTargetUri(imageFolder +resourcePath2.substring(resourcePath2.lastIndexOf("/")+1) + "_files");
htmlSettings.setWmlPackage(wordMLPackage);
OutputStream os;
os = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_SAVE_FLAT_XML);
DOCX = ((ByteArrayOutputStream)os).toString();
You may add like this in your code
package tcg.doc.web.managedBeans;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;
#Component
#Scope("session")
#Qualifier("ConvertWord")
public class ConvertWord {
private static final String docName = "TestDocx.docx";
private static final String outputlFolderPath = "d:/";
String htmlNamePath = "docHtml.html";
String zipName="_tmp.zip";
File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
System.out.println("OutputStream "+out.toString());
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
}
}
public static void main(String[] args) {
ConvertWord cwoWord=new ConvertWord();
cwoWord.ConvertWordToHtml();
System.out.println();
}
public String htmlPath(){
// d:/docHtml.html
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
}
For maven Dependency on pom.xml
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
<version>1.0.4</version>
</dependency>
or download it from Here
For images to work in an email body, I guess you need to use either a data URI or publish them to a web-reachable location.
In either case, you'll need to write an implementation of:
public interface ConversionImageHandler {
/**
* #param picture
* #param relationship of the image
* #param part of the image, if it is an internal image, otherwise null
* #return uri for the image we've saved, or null
* #throws Docx4JException this exception will be logged, but not propagated
*/
public String handleImage(AbstractWordXmlPicture picture, Relationship relationship, BinaryPart part) throws Docx4JException;
}
and configure docx4j to use it with htmlSettings.setImageHandler.
You can look at some of the existing implementations in the docx4j source code, and take advantage of the helper methods in AbstractConversionImageHandler (eg createEncodedImage if you want data URIs).
I have successfully followed tutorials of how to embed images into HTML with javamail. However i am now trying to read from a template html file and then embed images into this before sending.
I am sure that the code is right for the embedding images as when i use:
bodyPart.setContent("<html><body><h2>A title</h2>Some text in here<br/>" +
"<img src=\"cid:the-img-1\"/><br/> some more text<img src=\"cid:the-img-1\"/></body></html>", "text/html");
The images display fine. However when i read from a file using:
readHTMLToString reader = new readHTMLToString();
String str = reader.readHTML();
bodyPart.setContent(str, "text/html");
The images do not show up when the email sends.
My code for reading the html to string is as follows:
public class readHTMLToString {
static String finalFile;
public static String readHTML() throws IOException{
//intilize an InputStream
File htmlfile = new File("C:/temp/basictest.html");
System.out.println(htmlfile.exists());
try {
FileInputStream fin = new FileInputStream(htmlfile);
byte[] buffer= new byte[(int)htmlfile.length()];
new DataInputStream(fin).readFully(buffer);
fin.close();
String s = new String(buffer, "UTF-8");
finalFile = s;
}
catch(FileNotFoundException e)
{
System.out.println("File not found" + e);
}
catch(IOException ioe)
{
System.out.println("Exception while reading the file " + ioe);
}
return finalFile;
}
}
My complete class for sending the email is as follows:
package com.bcs.test;
import java.io.IOException;
import java.util.Properties;
import javax.activation.DataHandler;
import javax.activation.DataSource;
import javax.activation.FileDataSource;
import javax.mail.BodyPart;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.Multipart;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeBodyPart;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;
public class SendEmail {
public static void main(String[] args) throws IOException {
final String username = "usernamehere#gmail.com";
final String password = "passwordhere";
Properties props = new Properties();
props.put("mail.smtp.auth", "true");
props.put("mail.smtp.starttls.enable", "true");
props.put("mail.smtp.host", "smtp.gmail.com");
props.put("mail.smtp.port", "587");
Session session = Session.getInstance(props,
new javax.mail.Authenticator() {
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication(username, password);
}
});
try {
Message message = new MimeMessage(session);
message.setFrom(new InternetAddress("from-email#gmail.com"));
message.setRecipients(Message.RecipientType.TO,
InternetAddress.parse("recepientemailhere"));
message.setSubject("Testing Subject");
//SET MESSAGE AS HTML
MimeMultipart multipart = new MimeMultipart("related");
// Create bodypart.
BodyPart bodyPart = new MimeBodyPart();
// Create the HTML with link to image CID.
// Prefix the link with "cid:".
//bodyPart.setContent("<html><body><h2>A title</h2>Some text in here<br/>" +
// "<img src=\"cid:the-img-1\"/><br/> some more text<img src=\"cid:the-img-1\"/></body></html>", "text/html");
readHTMLToString reader = new readHTMLToString();
String str = reader.readHTML();
// Set the MIME-type to HTML.
bodyPart.setContent(str, "text/html");
// Add the HTML bodypart to the multipart.
multipart.addBodyPart(bodyPart);
// Create another bodypart to include the image attachment.
BodyPart imgPart = new MimeBodyPart();
// Read image from file system.
DataSource ds = new FileDataSource("C:\\temp\\dice.png");
imgPart.setDataHandler(new DataHandler(ds));
// Set the content-ID of the image attachment.
// Enclose the image CID with the lesser and greater signs.
imgPart.setDisposition(MimeBodyPart.INLINE);
imgPart.setHeader("Content-ID","the-img-1");
//bodyPart.setHeader("Content-ID", "<image_cid>");
// Add image attachment to multipart.
multipart.addBodyPart(imgPart);
// Add multipart content to message.
message.setContent(multipart);
//message.setText("Dear Mail Crawler,"
// + "\n\n No spam to my email, please!");
Transport.send(message);
System.out.println("Done");
} catch (MessagingException e) {
throw new RuntimeException(e);
}
}
}
Ive read through numerous answers about this but really not sure why this is happening. I thought it was because of an issue with my html file however i created a very basic one using the same content as the initial setContent code above and the pictures dont appear in this basic example.
Something to do with reading into a byte array?
Any help greatly appreciated.
Thanks
The way email clients interpret HTML code is different from writing to HTML template file. But one thing you could try for sure is once you get the template, copy the byte array of the image to the src attribute. You could try with inline images as browser inteprets src attribute and make another request to get the data.
Gives you a lot more insight in to the concept.Inline Images in HTML
Of course you need to make sure that the data in the file is actually encoded in UTF-8 and not in the default encoding for your computer. If you test this with all ASCII text, it shouldn't matter.
Assuming you have the same text in the file that you have in the string in the sample code above, you can compare the two cases (string, file) to see how the messages JavaMail would send differ by using message.writeTo(new FileOutputStream("msg.txt")); just before or in place of the Transport.send call.
The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}
I have a web application where I need to display .eml files (in RFC 822 format) to the users, formatted properly as e-mail - show the HTML to text body properly, show images, attachments and so on. Do you know of a component / library that can do those things?
I prefer it would be in Java (and to integrate with spring easily :-) ), but any other implementation which runs on Apache is fine as well.
Javamail can read EML file.
import java.util.*;
import java.io.*;
import javax.mail.*;
import javax.mail.internet.*;
public class ReadEmail {
public static void main(String args[]) throws Exception{
display(new File("C:\\temp\\message.eml"));
}
public static void display(File emlFile) throws Exception{
Properties props = System.getProperties();
props.put("mail.host", "smtp.dummydomain.com");
props.put("mail.transport.protocol", "smtp");
Session mailSession = Session.getDefaultInstance(props, null);
InputStream source = new FileInputStream(emlFile);
MimeMessage message = new MimeMessage(mailSession, source);
System.out.println("Subject : " + message.getSubject());
System.out.println("From : " + message.getFrom()[0]);
System.out.println("--------------");
System.out.println("Body : " + message.getContent());
}
}
Handle EML file with JavaMail
You could convert .eml into javax.mail.Messages mailMessage as:
Loading .eml files into javax.mail.Messages
then you could use this library to convert in MessageBean:
http://javaclue.blogspot.com/2009/09/portable-java-mail-message-bean_02.html
MessageBean mb = MessageBeanUtil.mimeToBean(mailMessage);