I have a problem displaying Cyrillic symbols. I have an HTML containing Cyrillic symbols. The problem is that after converting they all displaying like ### instead of symbols. I'm using the library like this:
var document = Jsoup.parse(new ByteArrayInputStream(resultHtml), "UTF-8", "/");
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (os) {
var temp = new W3CDom().fromJsoup(document);
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.toStream(os);
builder.useFont(new File("/resources/fonts/times.ttf"), "Times");
builder.withW3cDocument(temp, null);
builder.run();
}
return os;
The resultHtml is a HTML string and it's okay, because using library iText7 I got the result I wanted: I got PDF with normal symbols, but the problem is that it's not free, I'm saying this only to cut the area of possible problems, so I assume the problem is in how I use the library. I don't really have any resources related to html, that's why it's baseUri is / and null. Library gives me 2 warnings but I don't think the problem is because of that because it says it's ignoring it.
com.openhtmltopdf.css-parse WARNING:: (null#inline_style_1) so-language is an unrecognized CSS property at line 21. Ignoring declaration.
com.openhtmltopdf.css-parse WARNING:: (null#inline_style_1) so-language is an unrecognized CSS property at line 32. Ignoring declaration.
I checked in the debug, I can see the document is okay because I can see the formed HTML with cyrillic symbols normally, but the temp is becoming [#document:null]. I read that it doesn't mean the document is null, but maybe it's the problem? I tried different charsets like CP1251, CP1252 but they're giving strange symbols too. At first I tried all charsets without the font declaration, because the only font in use is TimesNewRoman and I think it's default, but then added it in resources and in code declaration, but it didn't help. I'm using 1.0.10 version of the library and 1.14.3 version of jsoup.
I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some � special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...
Alchemy API is used in my program for extracting keywords and relations from a URL.
When extracting these from the API calls I'm getting the error as follows,
'java.io.IOException: Error making API call: cannot-retrieve:downstream-http-error:404.
at com.alchemyapi.api.AlchemyAPI.doRequest(AlchemyAPI.java:960)
at com.alchemyapi.api.AlchemyAPI.GET(AlchemyAPI.java:914)
at com.alchemyapi.api.AlchemyAPI.URLGetRankedKeywords(AlchemyAPI.java:234)
at com.alchemyapi.api.AlchemyAPI.URLGetRankedKeywords(AlchemyAPI.java:224)
at innointel.feature1.Article.alchemyCall(Article.java:477)'
Then I found "http://venturebeat.com/2014/10/22/microsoft-and-ibm-partner-to-bring-enterprise-software-to-their-respective-cloud-platforms/" was the URL causing the error.I called the relation API simply putting the URL as follows ..
Document doc = alchemyObj.URLGetRelations("http://venturebeat.com/2014/10/22/microsoft-and-ibm-partner-to-bring-enterprise-software-to-their-respective-cloud-platforms/");
Well now, it contains no error . What is actually happening here ??
I found in some websites that "cannot-retrieve:downstream-http-error:404" is due to the invalid URL passed as argument.
Out Of 50 URL i tested 7 URL shows the error.Remaining works fine.And again when i extract the URL string placed it as a argument 7 URL works fine too..
(URL is parsed from an excel document using POI API)
Thanks in advance
As you said
'java.io.IOException: Error making API call: cannot-retrieve:downstream-http-error:404'
Exception is caused by the wrong URL argument in function call (ie.URLGetRankedKeywords())
Since the URL is parsed from the EXCEL Document some times the '\r' character might be present at the end of the cell.If present, this will make the URL invalid .
What you can do is , Remove all '\r' character from the URL before you pass it to API call.
ie ,
url = url.replaceAll("\r", "");
Document doc = alchemyObj.URLGetRelations(url);
This might work.it's worked for me.
I need to get the source code of the particular URL using a java code. I was able to get the source code for UTF-8 encoded web page but was not able to get the code for ISO-8859-1 encoded character set. My question, is it possible to get the source code of website with iso-8859-1 using a java program? Please help
If you are reading by using following method you need to Specify character set explicitly by
URL url = new URL(URL_TO_READ);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),"ISO-8859-1" ));
How ever if there is little parsing include with your requirement I would suggest you to use JSOUP and it will read the character-set from the response of server, Also you could explicitly set the charset
I have a jsf app that has international users so form inputs can have non-western strings like kanjii and chinese - if I hit my url with ..?q=東日本大 the output on the page is correct and I see the q input in my form gets populated fine. But if I enter that same string into my form and submit, my app does a redirect back to itself after constructing the url with the populated parameters in the url (seems redundant but this is due to 3rd party integration) but the redirect is not encoding the string properly. I have
url = new String(url.getBytes("ISO-8859-1"), "UTF-8");
response.sendRedirect(url);
But url redirect ends up being q=???? I've played around with various encoding strings (switched around ISO and UTF-8 and just got a bunch of gibberish in the url) in the String constructor but none seem to work to where I get q=東日本大 Any ideas as to what I need to do to get the q=東日本大 populated in the redirect properly? Thanks.
How are you making your url? URIs can't directly have non-ASCII characters in; they have to be turned into bytes (using a particular encoding) and then %-encoded.
URLEncoder.encode should be given an encoding argument, to ensure this is the right encoding. Otherwise you get the default encoding, which is probably wrong and always to be avoided.
String q= "\u6771\u65e5\u672c\u5927"; // 東日本大
String url= "http://example.com/query?q="+URLEncoder.encode(q, "utf-8");
// http://example.com/query?q=%E6%9D%B1%E6%97%A5%E6%9C%AC%E5%A4%A7
response.sendRedirect(url);
This URI will display as the IRI ‘http://example.com/query?q=東日本大’ in the browser address bar.
Make sure you're serving your pages as UTF-8 (using Content-Type header/meta) and interpreting query string input as UTF-8 (server-specific; see this faq for Tomcat.)
Try
response.setContentType("text/html; charset=UTF-16");
response.setCharacterEncoding("utf-16");