How to determine encoding of an ByteArrayOutputStream?

How to determine encoding of an ByteArrayOutputStream? - java

I need to convert ByteArrayOutputStream to String but I can't figure out the encoding. Please help ? I tried Using ICUJ library but It only works for input stream. A conversion from byte array to input stream is also fine.
Here's a sample of what I'm getting using the default encoding. Clearly the new lines are not supposed be there.
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n
<html>
\n
<head>
\n
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
\n
<style type=\"text/css\">\n .style_0 { font-family: sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 10pt; color: black; text-indent: 0em; letter-spacing: normal; word-spacing: normal; text-transform: none; white-space: normal; line-height: normal;}\n .style_1 { height: 5.062in; width: 8.01in;}\n </style>
\n <script type=\"text/javascript\">\n //<![CDATA[\n function redirect(target, url){\n if (target =='_blank'){\n open(url);\n }\n else if (target == '_top'){\n window.top.location.href=url;\n }\n else if (target == '_parent'){\n location.href=url;\n }\n else if (target == '_self'){\n location.href =url;\n }\n else{\n open(url);\n }\n }\n //]]>\n </script>\n
</head>
\n <body class=\"style_0\" style=\" margin:0px;\">\n <table cellpadding=\"0\" style=\"empty-cells: show; border-collapse:collapse; width:8in; overflow: hidden; table-layout:fixed;\">\n
<col>
</col>\n
<tr>
\n
<td></td>
\n
</tr>
\n
<tr>
\n
<td valign=\"top\"></td>
\n
</tr>
\n
<tr>
\n
<td>
\n
<div style=\"overflow:hidden; height:0.5in\">\n <div style=\" overflow:hidden;\">Dec 23, 2013, 7:11 PM</div>
\n </div>\n
</td>
\n
</tr>
\n </table>\n
<hr style=\"color:red\"/>
\n
<div style=\"color:red\">
\n
<div>The following items have errors:\n </div>
\n <br>\n
<div>
\n
<div id=\"error_title\" style=\"text-decoration:underline\">
Chart (id = 12):
\n

I tried Using ICUJ library but It only works for input stream.
You can get the byte array from the ByteArrayOutputStream, and then wrap it in a ByteArrayInputStream ... and pass that to the ICUJ method.
(Bear in mind that there is a chance that ICUJ will intuit the wrong encoding. Or that the bytes might not represent text in any known encoding.)

It won't much help but java.nio.charset.CharsetDecoder has a detectedCharset() method to auto identify charset of character encoded bytes. But unfortunately current impl of CharSetDecoder in Java SE7 (the one that is received by calling the method Charset.newDecoder()) is not an auto detecting charset decoder so calling detectedCharSet() method throws UnsupportedOperationException.

Related

Simple way to display currency symbol in html2pdf for iText 7

I updated my code from iText 5.0 to iText 7 and html2pdf 2.0 according to this post. In earlier version rupee symbol was rendering properly, but because of css issue i changed the code. Now complete page is converting properly to pdf except rupee symbol.
Tried adding font in html style tag itself like * { font-family: Arial; }.
Changed value of rupee symbol from &#x20b9, ₹ and also added directly ₹ , but no use.
My Html:
<html>
<head>
<style>
* { font-family: Arial; }
</style>
<title>HTML div</title>
</head>
<body>
<p style="margin-bottom: 0in; padding-left: 60px;">
<div style="font-size: 450%; text-indent: 150px;">
<strong>BUY <span style="color: #ff420e;">2</span> GET
</strong>
</div>
</p>
<div
style="float: left; display: inline-block; margin: 10px; text-align: right; font-size: 70%; line-height: 27; transform: rotate(270deg);">Offer
Expiry Date : 30/11/2017</Div>
<div
style="float: left; display: inline-block; margin: 10px; text-align: right; font-size: 350%;">
₹
<!-- ₹ -->
</div>
<div
style="float: left; display: inline-block; margin: auto; font-size: 1500%; color: red; font-weight: bold;">99</div>
<div
style="float: left; display: inline-block; margin: 10px; text-align: left; font-size: 250%; line-height: 10;">OFF</div>
<div
style="position: absolute; height: 40px; font-size: 250%; line-height: 600px; color: red; text-indent: 50px">Pepsi
2.25 Pet Bottle ltr</div>
<div
style="position: absolute; height: 40px; font-size: 245%; line-height: 694px; text-indent: 50px">
MRP: ₹ <span style="color: #ff420e;">654</span>
</div>
</body>
</html>
Java Code :
public class Test {
final static String DEST = "D://Workspace_1574973//POP//sample_12.pdf";
final static String SRC = "D://Workspace_1574973//POP//src//com//resources//test.html";
public static void main(String[] args) throws Exception {
createPdf(SRC, DEST);
}
public static void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
}
Earlier code, which was working with symbols.
log.info("Creating file start");
OutputStream file = new FileOutputStream(new File("font_check.pdf"));
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
InputStream is = new ByteArrayInputStream(fileTemplate.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
log.info("Creating file end");
Is there any simple approach to achieve this, with minimal and optimized code ?
Because I've to generate thousands of pdf in one go, So the performance should not affect.
Please let me know, if anyone achieved this through latest version.
Edit : Also how to set particular paper type in this like A6, A3, A4 etc.

Hope you are not mad, because I don't have reputation to write simple comments... so I'll post a full answer instead. I parse HTML for my work, and I read SO sometimes. There is a lot on the subject regarding UTF-8 here. Most software systems support the "greater than char #256" (UTF-8) codes - for instance the Indian Rupee Symbol. However, most of the time the programmer has to include a specific request for such a desired behavior, explicitly.
In HTML, for instance - adding this line usually helps:
String UTF8MetaTag = "<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />";
Anyway, not having used HTMLToPDF - I might not be the right guy to post answers to your questions - but, because I have dealt with UTF-8 foreign language characters for three years, I know that setting a software setting to handle the 65,000 or so chars is usually VERY EASY, BUT ALSO ALWAYS VERY MANDATORY.
Here is an SO post about using HTMLToPDF and UTF-8 to handle Japanese Kanji characters. Most likely, it should handle all UTF-8, but that is not a guarantee.
HTML2PDF support for japanese language(utf8) is not working
Here are a few posts about it using HTML2PDF in PHP:
Converting html 2 pdf (php) using hebrew returns "???"
Having æøå chars in HTML2PDF charset

number cannot be wrapped in flyingsaucer and itext

environment is : flyingsaucer r8 and itext 2.0.8
I'm going to create pdf file with flyingsaucer and itext, I added
table-layout:fixed;word-wrap:break-word;
to wrap the cell content. However, the generated pdf file looks like next :
from the above diagram, we can see that over long English sentence of column 'Description' can be wrapped correctly, but the over long number in 'Account Code' and 'Description' cannot be wrapped.
I also tried "word-break: break-all;", but it still doesn't work.
my xhtml file is :
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<style type="text/css">
#page {-fs-flow-top: 'header';-fs-flow-bottom: 'footer';size:210mm 297mm;margin-top:40pt;#top-center {content: element(header)}
#bottom-center {content: element(footer)}page-break-before:always}#pagenumber:before { content: counter(page); }
#pagecount:before { content: counter(pages); }.pageNext{page-break-before: always;}
#header {position: running(header);font-style: italic;font-family: Arial Unicode MS;
-fs-move-to-flow: 'header';width: 100px; padding-top:10pt;}#footer {position: running(footer);font-style: italic;
font-family: Arial Unicode MS;-fs-move-to-flow: 'footer';color: #6c6ce8;}
body{font-size:13px;font-family:Arial Unicode MS;white-space:inherit;}b{font-size:13px;font-weight:bold;font-family:Arial;}
#dt th{text-align: center;font-family:Arial;font-weight:bold;}#title {font-size:15px;font-weight:bold;}#lab {font-size:15px;font-weight:bold;}
table.data{ border-top: 1px solid #333;border-bottom: 1px solid #333;width:100%;border:1px solid #333;}table{ border-collapse:collapse;}
table td{ padding:0 0 0 0; vertical-align:top;white-space:inherit;}
table.uReportStandard > thead > tr > th{ border:0.5pt #333 solid; background:#d5d0c3;color:#000;text-align:center;font-size:15px;
font-family:Arial,sans-serif;font-weight:bold;}
table.uReportStandard > tbody > tr > td{ padding:1px 1px; font-size:13px;}.data td.left_text{ font-size:13px;
font-family:Arial Unicode MS,sans-serif;width:300px;}.data td.right_text{ text-align:right;font-size:13px;
font-family:Arial Unicode MS,sans-serif;width:120px;}table#uPageCols td#uRightCol,table#uPageCols td#uRightCol aside{width:0;}
table.uReportStandard{border:0.5px #333 solid;}
</style>
<meta http-equiv='content-type' content='text/html; charset=UTF-8' />
<title></title>
</head>
<body>
<div id='footer' style='text-align:center;margin-top:0;'>
Page <span id='pagenumber' /> of <span id='pagecount' /><span style='margin-left:150px;'>2016-05-23 16:03:07</span>
</div>
<div>
<table border='0' id='dt' style='width:100%;table-layout:fixed;word-wrap:break-word;'>
<thead>
<tr style='background-color: gainsboro;border:solid 1px #333;'>
<th style='border:solid 1px #333;width:10%;'>Account Code</th>
<th style='border:solid 1px #333;width:29%;'>Bank Name</th>
<th style='border:solid 1px #333;width:35%;'>Description</th>
<th style='border:solid 1px #333;width:13%;'>Load(CNY)</th>
<th style='border:solid 1px #333;width:13%;'>Borrow(CNY)</th>
</tr>
</thead>
<tr>
<td style='border:solid 1px #333;'>66020901039</td>
<td style='border:solid 1px #333;'>ABC DEF<br />
-Global Logistics LTD</td>
<td style='border:solid 1px #333;white-space:inherit;'>break-all Behaves the same as normal for Asian text, yet allows the line to break arbitrarily for non-Asian text. This value is suited to Asian text that contains some excerpts of non-Asian text. Debit: EUR 50.00</td>
<td style='text-align: right;border:solid 1px #333;'>367.47</td>
<td style='text-align: right;border:solid 1px #333;'>
</td>
</tr>
<tr>
<td style='border:solid 1px #333;'>220201</td>
<td style='border:solid 1px #333;'>ACCOUNT PAYABLE<br />
-Global Logistics LTD</td>
<td style='border:solid 1px #333;white-space:inherit;'>88888888888888888889999999999999999997777777777777 Credit: EUR 284.36</td>
<td style='text-align: right;border:solid 1px #333;'>
</td>
<td style='text-align: right;border:solid 1px #333;'>2,089.85</td>
</tr>
</table>
</div>
</body>
</html>
my code is :
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(new File("test.xhtml"));;
renderer.layout();
FileOutputStream fos = new FileOutputStream("a.pdf");
renderer.createPDF(fos);
fos.close();
my questions are :
how table-layout:fixed;word-wrap:break-word; wrap content? it base on what to wrap?
why English/Chinese centense can be wrapped correctly, while number cannot?
how to wrap the number in my case?
Thanks in advance!
after using itext 5.5.9, number can be wrapped correctly, but now css : -fs-table-paginate: paginate; #pagenumber:before { content: counter(page); }#pagecount:before { content: counter(pages); } doesn't work in itext 5.5.9

as #Lonzak said, I just replaced core-renderer-r8.jar with flying-saucer-core.9.0.9.jar and flying-saucer-pdf.9.0.9.jar, use itext 2.1.7, then it works with css :
table-layout:fixed;word-wrap:break-word
, the long non-blank string or number can be wrapped correctly.

I was breaking my mind with that and i would recommend you to follow this solution.
If you are using thymeleaf try to use the .abbreviate method :
${#strings.abbreviate( ${exampleOfText}, 10)}
source

Modifying HTML using java

I am trying to read a HTML file and add link to some of the texts :
for example :
I want to add link to "Campaign0" text. :
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">101</span></p></td>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">Campaign0</span>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">unknown</span></p></td>
Link to be added:
<a href="Second.html">
I need a JAVA program that modify html to add hyperlink over "Campaign0" .
How i do this with Jsoup ?
I tried this with JSoup :
File input = new File("D://First.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span").first(); <-- this is only for first span tag :(
span.wrap("");
Is this correct ?? It's not working :(
In short : is there anything like-->
if find <span>Campaign0</span>
then replace by <span>Campaign0</span>
using JSoup or any technology inside JAVA code??

Your code seems pretty much correct. To find the span elements with "Campaign0", "Campaign1", etc., you can use the JSoup selector "span:containsOwn(Campaign0)". See additional documentation for JSoup selectors at jsoup.org.
After finding the elements and wrapping them with the link, calling doc.html() should return the modified HTML code. Here's a working sample:
input.html:
<table>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</table>
Code:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span:containsOwn(Campaign0)").first();
span.wrap("");
span = doc.select("span:containsOwn(Campaign1)").first();
span.wrap("");
String html = doc.html();
BufferedWriter htmlWriter =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.html"), "UTF-8"));
htmlWriter.write(html);
htmlWriter.close();
output:
<html>
<head></head>
<body>
<table>
<tbody>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</tbody>
</table>
</body>
</html>

Replace a substring with a StringBuffer substring

I have a Huge string which is complete html obtained into a string by JSOUP.I have made changes to a substring of the html using String Bufer replace API(replace(int startIndex,int endIndex, "to be changed string).The String buffer is populated perfectly.But when I try to replace the substring of html with new String buffer it does not work.
Here is the code snippet.
html = html.replace(divStyle1.trim(), heightwidthM.toString().trim());
The initial big html is
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
</head>
<body>
**<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">**
<div style="height:2058px; padding-left:0px; padding-top:36px;">
<iframe style="height:90px; width:728px;"/>
</div>
</div>
</body>
</html>
The divStyle1 string is
background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;
And the String buffer has value
background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height:720px; width:900px; text-align: center; margin: 0 auto;
does not work where divStyle is a substring of the last HTML(in String) and heightwidthM is a Stringbuffer value with which it has to be replaced.It doesnt throw any errors but it does not change it as well.
Thanks
Swaraj

This is very easy with JSoup
String html = "<!DOCTYPE html>\n<html xmlns:og=\"http://opengraphprotocol.org/schema/\" xmlns:fb=\"http://www.facebook.com/2008/fbml\" xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\" class=\"SAF\" id=\"global-header-light\">\n<head>\n\n</head>\n<body>\n\n\n**<div style=\"background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;\">** \n\n<div style=\"height:2058px; padding-left:0px; padding-top:36px;\">\n\n\n<iframe style=\"height:90px; width:728px;\"/>\n\n\n\n</div>\n</div>\n\n</body>\n</html>";
String newStyle = "background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height:720px; width:900px; text-align: center; margin: 0 auto;";
Document document = Jsoup.parse(html);
document.body().child(0).attr("style", newStyle);
System.out.println(document.html());

Coming back to my suggestion, if you don't mind trying, you can do something of this sort:
Document newDocument = Jsoup.parse(<your html string>, StringUtils.EMPTY, Parser.htmlParser());
Elements yourStyles = newDocument.select("div[style]"); // this will select all div with attributes style
yourStyles.get(0).attr("style", <your new value>); // this will get your first div and replace attribute style to your new value
System.out.println(newDocument.outerHtml());

Creating CSS from a HTML file

I have an html file which contains many elements:
<div>
<div id="imgElt11289447233738dIi15v" style="BORDER-RIGHT: 0px; BORDER-TOP: 0px; Z-INDEX: 1; LEFT: 795px; BORDER-LEFT: 0px; WIDTH: 90px; CURSOR: auto; BORDER-BOTTOM: 0px; POSITION: absolute; TOP: 186px; HEIGHT: 93px" lineid="lineid" y2="279" y1="186" x2="885" x1="795">
<img style="WIDTH: 90px; HEIGHT: 93px" height="21" alt="Image" src="../images//k03.jpg" width="25" name="imgElt11289447233738dIi15vNI1m6G" tag="img"></img></div>
<div id="imgElt11288263284216dIi15v" style="BORDER-RIGHT: 0px; BORDER-TOP: 0px; Z-INDEX: 1; LEFT: 660px; BORDER-LEFT: 0px; WIDTH: 147px; CURSOR: auto; BORDER-BOTTOM: 0px; POSITION: absolute; TOP: 1964px; HEIGHT: 22px" lineid="lineid" y2="1986" y1="1964" x2="807" x1="660">
<img style="WIDTH: 147px; HEIGHT: 22px" height="21" alt="Image" src="../images//k03.jpg" width="25" name="imgElt11288263284216dIi15vNI1m6G" tag="img"></img></div>
<div id="txtElt11288262779851dIi15v" style="BORDER-RIGHT: 0px; BORDER-TOP: 0px; Z-INDEX: 2872735; LEFT: 250px; BORDER-LEFT: 0px; WIDTH: 95px; CURSOR: auto; BORDER-BOTTOM: 0px; POSITION: absolute; TOP: 1514px; HEIGHT: 18px" selectedindex="0" pos_rel="false" lineid="lineid" y2="1532" y1="1514" x2="345" x1="250" tag="div">
<p><strong><font face="arial,helvetica,sans-serif" size="2">Course Name</font></strong></p>
</div>
<div id="txtElt11288262309675dIi15v" style="BORDER-RIGHT: 0px; BORDER-TOP: 0px; Z-INDEX: 1565881; LEFT: 40px; BORDER-LEFT: 0px; WIDTH: 430px; CURSOR: auto; BORDER-BOTTOM: 0px; POSITION: absolute; TOP: 1464px; HEIGHT: 34px" selectedindex="0" pos_rel="false" lineid="lineid" y2="1498" y1="1464" x2="470" x1="40" tag="div">
<p><strong>
<font face="arial,helvetica,sans-serif" size="2" tag="font">16. Please
write below the Course Name in order of preference.</font></strong></p>
<p tag="p"><strong><font face="Arial" size="2" tag="font"> (Please
see the "Instructions to Candidate" for list of courses)</font></strong></p>
</div>
</div>
As can be seen, 1 div has many divs in it. Now I want to create a css file that will contain all the styling of this html page (need not be same). Have to write something in java code. I have the DOM object of this file available to me.
Basically, I want all the styles to be removed from here and will be kept under a CSS file like for div with id = imgElt11289447233738dIi15v css will be:
#imgElt11289447233738dIi15v{BORDER-RIGHT: 0px; BORDER-TOP: 0px; Z-INDEX: 1; LEFT: 795px; BORDER-LEFT: 0px; WIDTH: 90px; CURSOR: auto; BORDER-BOTTOM: 0px; POSITION: absolute; TOP: 186px; HEIGHT: 93px}
I am don't till this part but since I don't know how many levels of hierarchy of elements will be there is there any way to do the same for all child elements as well?
I used the following code
public static Document getStyleInCSSfile(Document aoDoc, String aoPathToWrite, String aoFileName) throws ApplicationException {
String loValue = null;
String loID = null;
String lsContent = "";
Element loRoot = aoDoc.getRootElement();
List loTempElementList = loRoot.getChildren();
int liCounter;
for (liCounter = 0; liCounter < loTempElementList.size(); liCounter++) {
Element loTemplateEle = (Element) loTempElementList.get(liCounter);
String loId=loTemplateEle.getAttribute("id").getValue();
loID = loTemplateEle.getAttributeValue("id");
if(null != loID)
{
loValue = loTemplateEle.getAttributeValue("style");
if(loValue!=null && loValue.trim().length()>0)
{
loTemplateEle.removeAttribute("style");
lsContent = lsContent.concat("#"+loID+"{"+loValue+"}\n");
}
}
}
SaveFormOnLocalUtil.writeToFile(aoPathToWrite,aoFileName,lsContent);
return aoDoc;
}
Edit : got to know that some regular expression may help by getting a string of SAX parser object and and using regular expression on it... any idea? any one? how to implement it

is it effective to define a style for each single tag?
if i were you i'd checked if any other tag has the same style and if all elements with one style had the same 'tag_name' i'd used the following:
tag_name{text-transform:uppercase;text-align:center;}
and every element with this tag name (if its' style isn't set in any other way) would have this style.
if there's a lot of different tags with the same style:
.class_name{text-transform:uppercase;text-align:center;}
<tag class="class_name">content</tag>

I think you should use SAX instead of DOM. In SAX you can register the handler that is called every time the parser sees new tag, attribute etc. In this case every time you see attribute "style" you should extract its value to CSS file.
The next approach is using Digester from jakarta.apache.org. It uses SAX and allows XML configuration (see DigesterDigester) that maps your value object directly yo your XML document.
Absolutely different solution may made using unix shell commands like grep and sed. The preference to one of the solution depends on your system requirements and how often do you have to run this code. If it is one time transformation use unix shell scripting. If it must be something robust and change the pages on the fly use java solution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to determine encoding of an ByteArrayOutputStream? - java

Related

Simple way to display currency symbol in html2pdf for iText 7

number cannot be wrapped in flyingsaucer and itext

Modifying HTML using java

Replace a substring with a StringBuffer substring

Creating CSS from a HTML file

Categories

Resources