OpenHtmlToPdf cyrillic symbols displaying

OpenHtmlToPdf cyrillic symbols displaying - java

I have a problem displaying Cyrillic symbols. I have an HTML containing Cyrillic symbols. The problem is that after converting they all displaying like ### instead of symbols. I'm using the library like this:
var document = Jsoup.parse(new ByteArrayInputStream(resultHtml), "UTF-8", "/");
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (os) {
var temp = new W3CDom().fromJsoup(document);
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.toStream(os);
builder.useFont(new File("/resources/fonts/times.ttf"), "Times");
builder.withW3cDocument(temp, null);
builder.run();
}
return os;
The resultHtml is a HTML string and it's okay, because using library iText7 I got the result I wanted: I got PDF with normal symbols, but the problem is that it's not free, I'm saying this only to cut the area of possible problems, so I assume the problem is in how I use the library. I don't really have any resources related to html, that's why it's baseUri is / and null. Library gives me 2 warnings but I don't think the problem is because of that because it says it's ignoring it.
com.openhtmltopdf.css-parse WARNING:: (null#inline_style_1) so-language is an unrecognized CSS property at line 21. Ignoring declaration.
com.openhtmltopdf.css-parse WARNING:: (null#inline_style_1) so-language is an unrecognized CSS property at line 32. Ignoring declaration.
I checked in the debug, I can see the document is okay because I can see the formed HTML with cyrillic symbols normally, but the temp is becoming [#document:null]. I read that it doesn't mean the document is null, but maybe it's the problem? I tried different charsets like CP1251, CP1252 but they're giving strange symbols too. At first I tried all charsets without the font declaration, because the only font in use is TimesNewRoman and I think it's default, but then added it in resources and in code declaration, but it didn't help. I'm using 1.0.10 version of the library and 1.14.3 version of jsoup.

Related

How to handle (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}

That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.

"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-rolesï¿¼/"

I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

U+FFFD is not available in this font's encoding: WinAnsiEncoding

I'm using PDFBox 2.0.1.
I try to dynamically add some (user provided) UTF8 text to the form fields and show the result to the user. Unfortunately either the pdf library is not capable of properly encoding special characters such as "äöü"... or I was not able find any useful documentation that could help me with this issue.
Can someone tell me what is wrong with the given code sample?
try (PDDocument document = PDDocument.load(pdfTemplate)) {
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDAcroForm form = catalog.getAcroForm();
List<PDField> fields = form.getFields();
for (PDField field : fields) {
switch (field.getPartialName()) {
case "devices":
// Frontend (JS): userInput = btoa('Gerät')
String userInput = ...
String name = new String(Base64.getDecoder().decode(base64devices), "UTF-8");
field.setReadOnly(true);
break;
}
}
form.flatten(fields, true);
document.save(bos);
}
And here the stacktrace of the error:
java.lang.IllegalArgumentException: U+FFFD is not available in this font's encoding: WinAnsiEncoding
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:368)
org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:286)
org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:315)
org.apache.pdfbox.pdmodel.interactive.form.PlainText$Paragraph.getLines(PlainText.java:169)
org.apache.pdfbox.pdmodel.interactive.form.PlainTextFormatter.format(PlainTextFormatter.java:182)
org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.insertGeneratedAppearance(AppearanceGeneratorHelper.java:373)
org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:237)
org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:144)
org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:263)
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:324)
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.flatten(PDAcroForm.java:213)
my.application.service.PDFService.generatePDF(PDFService.java:201)
I also found those (related) issues on SO:
pdfbox: ... is not available in this font's encoding
But that does not help me choose the right encoding or how. IIRC Java uses UTF16 internally for character encoding why is the default not enough though?
Is that an issue of the PDF-document itself or the code I use to set it?
PdfBox encode symbol currency euro
Well its dynamic user input, so there are way to many things I would have to replace myself.
Thus, if the PDFBox people decided to fix the broken PDFBox method, this seemingly clean work-around code here would start to fail as it would then feed the fixed method broken input data.
Admittedly, I doubt they will fix this bug before 2.0.0 (and in 2.0.0 the fixed method has a different name), but one never knows...
Unfortunately I was not able to find this other setter method, but it might also be a different scope it does apply to.
EDIT
Updated example code to better represent the problem.

U+FFFD is used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function (source).
That said it is likely that that character gets messed up somewhere. Maybe the encoding of the file is not UTF-8 and that's why the character is messed up.
As a general rule you should only write ASCII characters in the source code. You can still represent the whole Unicode range using the escaped form \uXXXX. In this case ä -> \u00E4.
-- UPDATE --
Apparently the problem is in how the user input get encoded/decoded from client/server side using the JS function btoa. A solution to this problem can be found at this link:
Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

FreeMarker Not Able Display Chinese Character

First time use FreeMarker on JAVA project and stack on configure the chinese character.
I tried a lot of examples to fix the code like below, but it still not able to make it.
// Free-marker configuration object
Configuration conf = new Configuration();
conf.setTemplateLoader(new ClassTemplateLoader(getClass(), "/"));
conf.setLocale(Locale.CHINA);
conf.setDefaultEncoding("UTF-8");
// Load template from source folder
Template template = conf.getTemplate(templatePath);
template.setEncoding("UTF-8");
// Get Free-Marker output value
Writer output = new StringWriter();
template.process(input, output);
// Map Email Full Content
EmailNotification email = new EmailNotification();
email.setSubject(subject);
.......
Saw some example request to make changes on the freemarker.properties but i have no this file. I just import the .jar file and use it.
Kindly advise what should i do to make it display chinese character.

What exactly is the problem?
Anyway, cfg.setDefaultEncoding("UTF-8"); should be enough, assuming your template files are indeed in UTF-8. But, another place where you have to ensure proper encoding is when you convert the the template output back to "binary" from UNICODE text. So FreeMarker sends its output into a Writer, so everything is UNICODE so far, but then you will have an OutputStreamWriter or something like that, and that has to use charset (UTF-8 probably) that can encode Chinese characters.

You need to change your file encoding of your .ftl template files by saving over them in your IDE or notepad, and changing the encoding in the save dialog.
There should be an Encoding dropdown at the bottom of the save dialog.

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");

Using "ISO-8859-1" helped me deal with the French charactes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.