docx4j conversion html->docx->html

docx4j conversion html->docx->html - java

I'm working on my first project using docx4j... My goal is to export xhtml from a webapp (ckeditor created html) into a docx, edit it in Word, then import it back into the ckeditor wysiwyg.
(*crosspost from http://www.docx4java.org/forums/xhtml-import-f28/html-docx-html-inserts-a-lot-of-space-t1966.html#p6791?sid=78b64a02482926c4dbdbafbf50d0a914
will update when answered)
I have created an html test document with the following contents:
<html><ul><li>TEST LINE 1</li><li>TEST LINE 2</li></ul></html>
My code then creates a docx from this html like so:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.createPackage();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.unmarshalDefaultNumbering();
XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
xHTMLImporter.setHyperlinkStyle("Hyperlink");
wordMLPackage.getMainDocumentPart().getContent()
.addAll(xHTMLImporter.convert(new File("test.html"), null));
System.out.println(XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true));
wordMLPackage.save(new java.io.File("test.docx"));
My code then attempts to convert the docx BACK to html like so:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.createPackage();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.unmarshalDefaultNumbering();
XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
xHTMLImporter.setHyperlinkStyle("Hyperlink");
WordprocessingMLPackage docx = WordprocessingMLPackage.load(new File("test.docx"));
AbstractHtmlExporter exporter = new HtmlExporterNG2();
OutputStream os = new java.io.FileOutputStream("test.html");
HTMLSettings htmlSettings = new HTMLSettings();
javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(
os);
exporter.html(docx, result, htmlSettings);
The html returned is:
<?xml version="1.0" encoding="UTF-8"?><html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<style>
<!--/*paged media */ div.header {display: none }div.footer {display: none } /*#media print { */#page { size: A4; margin: 10%; #top-center {content: element(header) } #bottom-center {content: element(footer) } }/*element styles*/ .del {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
/* TABLE STYLES */
/* PARAGRAPH STYLES */
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}
.Normal {display:block;}
/* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
-->
</style>
<script type="text/javascript">
<!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script>
</head>
<body>
<!-- userBodyTop goes here -->
<div class="document">
<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 17mm;text-indent: -0.25in;margin-bottom: 0in;">• <span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;">TEST LINE 1</span>
</p>
<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 17mm;text-indent: -0.25in;margin-bottom: 0in;">• <span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;">TEST LINE 2</span>
</p>
</div>
<!-- userBodyTail goes here -->
</body>
</html>
There is a lot of extra space created after each line now. Not sure why this is happening, the conversion appears to add a lot of extra white space/carriage returns.

Its not clear from your question whether you are worried about whitespace in the (X)HTML source document, or in your page as rendered (presumably in CKEditor). If the latter, then the browser and CK version may be relevant.
Whitespace may or may not be significant; try Googling 'xhtml significant whitespace' for more.
By way of background, depending on docx4j property docx4j.Convert.Out.HTML.OutputMethodXML, docx4j will use
<xsl:output method="html" encoding="utf-8" omit-xml-declaration="no" indent="no"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
or
<xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no" indent="no"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
Note the different in the value of #method. If you want something different, you can modify docx2html.xsl or docx2xhtml.xsl respectively.

Is there a way to convert wordMLPackage to html without all the extra stuff like:
<?xml version="1.0" encoding="UTF-8"?>
and the css?
Could it just be something simple as the original html and inline css like <html><body><div style="...."></div></body></html> ?

Related

How can I set newline behavior for htmleditorkit

I have a JEditorPane set to edit html input and I would like to change the newline behavior so that when I go to a new line it inserts <\br> instead of surrounding the text in <p></p>. At the moment I have the following.
newSignatureScrollPane = new javax.swing.JScrollPane();
newSignatureEditorPane = new javax.swing.JEditorPane();
newSignatureEditorPane.setContentType("text/html"); // NOI18N
newSignatureEditorPane.setDocument(new HTMLDocument());
newSignatureEditorPane.setEditorKit(new HTMLEditorKit());
newSignatureScrollPane.setViewportView(newSignatureEditorPane);
This results in the following when I do a newSignatureEditorPane.getText() in my saveChangesButtonAction:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
Line 1
</p>
<p style="margin-top: 0">
Line 2
</p>
</body>
</html>

How to pass a variable from HTML to Java?

I want to pass a variable from HTML to Java. For this, I wrote the following code:
<!doctype html>
<html>
<title>How to create a typewriter or typing effect with jQuery</title>
<div id="example1">fsdfsdfojsdlk sdfj lskdhfk sdf </div>
<style>
body{
background: transparent;
color: #ec5a62;
}
#container{
font-size: 7em;
}
</style>
</head>
<body>
<div id="container"></div>
<!--
We use Google's CDN to serve the jQuery js libs.
To speed up the page load we put these scripts at the bottom of the page
-->
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.10.3/jquery-ui.min.js"></script>
<script>
//define text
var text = ("document.getElementById("example1")");
//text is split up to letters
$.each(text.split(''), function(i, letter){
//we add 100*i ms delay to each letter
setTimeout(function(){
//we add the letter to the container
$('#container').html($('#container').html() + letter);
}, 30*i);
});
</script>
</body>
</html>
But it is not working. How can I achieve this?
Please do help me.
I'm using var text =("document.getElementById("example1")");
But its not working.

to get value use var x=document.getElementById("example1").value;
your code should be like this:
var text=document.getElementById("example1").value;
//text is split up to letters
$.each(text.split(''), function(i, letter){
//we add 100*i ms delay to each letter
setTimeout(function(){
//we add the letter to the container
$('#container').html($('#container').html() + letter);
}, 30*i);
});

How to get the mail-content without the whole source code?

I read some mails out with javax.
Then I want to save the content of a message.
For example, I read a mail with the simple content of By: Test.
Now I read the content with the .getContent() method:
Object body = message.getContent();
String content = ((body instanceof String) ? (String) body : "NO STRING CONTENT");
But the problem here is, the simple e-mail content of By: Test gets displayed by the whole Outlook-source code of the message:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
#font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.E-MailFormatvorlage17
{mso-style-type:personal-compose;
font-family:"Arial","sans-serif";
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
#page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DE-CH" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">By: Test<o:p></o:p></span></p>
</div>
</body>
</html>
So how can I read out a mail-content without getting the whole mail source-code?

First, I would start by extracting the content in the <body> section of the String. Afterwards, it depends on your liking, but you could remove every HTML-tag, for example, but beware that any formatting (line breaks!) code is gone and you get only a big chunk of text.

I just remember the simple and better way. You can just take a plain/text piece of the email.
String content = getPlainText((Part)message);
private String getPlainText(Part p) throws MessagingException, IOException {
if (p.isMimeType("text/plain")) {
return (String) p.getContent();
} else if (p.isMimeType("multipart/*")) {
Multipart mp = (Multipart) p.getContent();
for (int i = 0; i < mp.getCount(); i++) {
String s = getPlainText(mp.getBodyPart(i));
if (s != null) return s;
}
}
return null;
}

Flying Saucer - html entities are not rendered

I'm generating pdf using flying-saucer lib. But I have problem with some html entities.
I've already was searching for solution I found many tips in this forum, and in other places but still there is the problem.
I've tried this approach :
http://sdtidbits.blogspot.com/2008/11/flying-saucer-xhtml-rendering-and-local.html
but without any success
My code look like this:
os = new FileOutputStream(pdf);
ITextRenderer renderer = new ITextRenderer();
ChainingReplacedElementFactory chainingReplacedElementFactory = new ChainingReplacedElementFactory();
chainingReplacedElementFactory.addReplacedElementFactory(new B64ImgReplacedElementFactory(renderer.getSharedContext()));
renderer.getSharedContext().setReplacedElementFactory(chainingReplacedElementFactory);
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
where pdf is the name of new pdf to create and url is
File f = new File(url);
if (f.exists()) {
url = f.toURI().toURL().toString();
}
my html file look like this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<style type="text/css">
html, body, div, span, applet, object, iframe, h1, h2, h3, h4, h5, h6, p, blockquote, pre, a, abbr, acronym, address, big, cite, code, del, dfn, em, font, img, ins, kbd, q, s, samp, small, strike, strong, sub, sup, tt, var, b, u, i, center, dl, dt, dd, ol, ul, li, fieldset, form, label, legend, caption, tbody, tfoot, thead, tr, th
{
color: #444;
font-family: Arial;
font-size: 14px;
line-height: 25px;
border: none;
}
table, td {border: solid 1px #CCC;}
img {page-break-inside: avoid;}
</style>
<title></title>
</head>
<body>
<h1>Test</h1>
<p>Html etites to test</p>
<p>←</p>
<p>←</p>
<p>↑</p>
<p>↑</p>
<p>↓</p>
<p></p>
</body>
</html>
Everything works fine beside those entities. There is nothing rendered only blank spots where should by arrows.
Does anyone has solution for that ?

The issue is that the font used by iText by default doesn't support the caracters you want to print.
The solution is to embed another font which can display this character, for example DejaVu.
In the java file, declare the font to the renderer:
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont("font/DEJAVUSANS.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
And change the font-family declaration in the HTML:
body
{
font-family: DejaVu Sans;
}

JTextPane getText() return html tag in different order

I am using JTextPane to store some HTML text:
private static final String HTML_STR = "<html><div>plot(<b><font color=#3775B9>X</font></b>,Y)</div><div>plot(<b><font color=#3775B9>X</font></b>,Y,LineSpec)</div></html>"
JTextPane textPane = new JTextPane();
textPane.setContentType("text/html");
textPane.setText(HTML_STR);
After that, every time I call the textPane.getText(). the html content will show the html tag in different order occasionally. Like:
sometimes, < b> is inside of < font>:
<head>
</head>
<body>
<div>
plot(<font color="#3775B9"><b>X</b></font>,Y)
</div>
<div>
plot(<font color="#3775B9"><b>X</b></font>,Y,LineSpec)
</div>
</body>
</html>
some other times, < font> is inside of < b>:
<head>
</head>
<body>
<div>
plot(<b><font color="#3775B9">X</font></b>,Y)
</div>
<div>
plot(<b><font color="#3775B9">X</font></b>,Y,LineSpec)
</div>
</body>
</html>
Can anybody explain a little bit for me why JTextPane behaviors like this? Is there any way to let JTextPane return the same order constantly?
Thanks!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

docx4j conversion html->docx->html - java

Is there a way to convert wordMLPackage to html without all the extra stuff like: <?xml version="1.0" encoding="UTF-8"?> and the css? Could it just be something simple as the original html and inline css like <html><body><div style="...."></div></body></html> ?

Related

How can I set newline behavior for htmleditorkit

How to pass a variable from HTML to Java?

How to get the mail-content without the whole source code?

Flying Saucer - html entities are not rendered

JTextPane getText() return html tag in different order

Categories

Resources