JAXB: can not read Japanese characters properly

JAXB: can not read Japanese characters properly - java

I am having a program which supports internationalisation. I have entries where input is provided in Japanese characters. On exporting that entry in XML, using JAXB, Japanese characters looks fine in the file. Proper character is been exported in the XML file. I am facing issue when unmarshal that XML file to get back data as Java object. I am not get proper unmarshalled value of japanese character.
Here is my marshalling code:
OutputStreamWriter outputWriter = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
JAXB.marshal(xmlobj, outputWriter);
Unmarshalling code:
InputStreamReader inputReader = new InputStreamReader(xml, "UTF-8");
xmlobj = JAXB.unmarshal(inputReader, <JAVA_CLASS_TO_UNMARSHAL>);
The text I am marshalling-unmarshalling is: 説明_1
It displays correctly on fetching this record and display it to browser, but in case of JAXB unmarshalling incorrect value is displayed. After converting it to HTML compatible code I got value 説明_1, which is actually correct conversion of Japanese characters. And it should appear as proper character on the browser, but it does not do so. It displays as HTML codes 説明_1 to the browser.
Any guess where I am doing wrong?

If the HTML contains
<html>
<body>
説明_1<br>
</body>
</html>
and good browser like Firefox (I have 31.0) should display 説明_1. Can you add the HTML section to your question?
If your browser isn't fit to display these characters, you should see something like .
You report that you see 説明_1, which is possible if your HTML text contains
&#35500;&#26126;_1<br>
which would mean that the transformation to HTML hasn't worked correctly.
Once more: check your HTML code, and how it is produced from the XML.

Try using UTF-8 in your HTML Header. Note that just changing the charset in the header won't convert the content — you need to make sure that the content is actually UTF-8 as well.
<Meta http-equiv = "Content-Type" content = "text / html; charset = UTF-8" >

Comment specified by Wundwin Born has solved the issue. I forgot to unescape string.
Here is the code snippet.
org.apache.commons.lang.StringEscapeUtils.unescapeHtml(xmlString);

Related

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Converting HTML files into PDF

I am using the following code to generate a PDF file of the HTML Report
String url = new File("Test.html").toURI().toURL().toString();
OutputStream os = new FileOutputStream("Test.pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
os.close();
I was able to use it on sample HTML files to convert to pdf. But when it comes to my real usage, the HTML content consists of various special symbols, like &,<,> that can't be parsed by XML.
I tried using CDATA, while generating HTML itself, but later found that the text around CDATA is not visible in HMTL.
Does anyone have a solution for this?

Have you tried to print to pdf from the browser? Google primo pdf for a program that we'll let you do it.

I don't know if this will help you, but you can use StringEscapeUtils from apache-commons. It has methods for escape and unescape HTML (you may use them to pre-process your HTML before PDF generation).

JAVA:How to store gujrati(its other language) font in string?

I have an application in java.
This application contains one text box and button.
Now I want to save Gujrati(other language)data in to database in click event of button.
How is it possible? Actually I done this think but my string return some other format.
So i don't know how to store gujrati data in to string ?

This works in Java as long as source code is in a Unicode-defined encoding, such as UTF-8 or UTF-16:
String ગઉજ = "ઋઊઘ";
That part solved, you need to specify where exactly your problem lies.

Java works in Unicode. Gujarati characters have unicode values as shown here
You can directly store them in a string. However if you can't directly take Gujrati input you can use the character class like this
int c = 0x0A82;
String s = Character.toString((char)c);
//s is ં
And so on

There are some changes that you should make in you website
In JSP file change
<%# page language="java" contentType="text/html; charset=UTF-16" pageEncoding="UTF-16"%>
and in .property file
save your text in uni-code format
Like this(here language of uni-code character is Gujarati
global.Name = \u0AA8\u0ABE\u0AAE
This will surely work as it worked for me in struts2
output will be :
નામ

displaying WINDOWS-1252 encoded text from file as html

i have a text file with WINDOWS-1252 characters like ø and ß. the file is being uploaded via form submit to a servlet, where it's being parsed with opencsv and returned as a List object to a jsp page where it's displayed.
the utf-8 chars are displayed as ? and i'm trying to figure out where along the way the encoding might have gone wrong.
i've tried a bunch of stuff:
my page has the tag <%#page contentType="text/html" pageEncoding="WINDOWS-1252"%>
file input is encoded - new FileInputStream(file), "WINDOWS-1252")
every string is encoded - s = new String(s.getBytes("WINDOWS-1252"));
where else can the encoding fail? any ideas?

Some troubleshooting suggestions:
Debug print or otherwise examine the text as hex at various phases, and verify that encoding really is what you expect it to be.
Make sure there is no BOM (Byte Order Marker), and see this question and links in it if there is and you don't have an easy way to get rid of it: Reading UTF-8 - BOM marker

OK problem is fixed.
So the first problem was that it wasn't a utf-8 file at all but a WINDOWS-1252 one. i determined that using the juniversalchardet lib (very helpful and easy-to-use).
Then i had to make sure that i'm reading the file with the right charset by using a FileInputStream:
new FileInputStream(file), "WINDOWS-1252")
the i just had to make sure that i am displaying it with the right charset in the jsp file using the tag <%#page contentType="text/html" pageEncoding="WINDOWS-1252"%>
that's pretty much it-
(1) determine charset
(2) make sure you're reading the file right
(3) make sure you display it right

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JAXB: can not read Japanese characters properly - java

Try using UTF-8 in your HTML Header. Note that just changing the charset in the header won't convert the content — you need to make sure that the content is actually UTF-8 as well. <Meta http-equiv = "Content-Type" content = "text / html; charset = UTF-8" >

Comment specified by Wundwin Born has solved the issue. I forgot to unescape string. Here is the code snippet. org.apache.commons.lang.StringEscapeUtils.unescapeHtml(xmlString);

Related

How to write data to pdf file which contains html tags using itext lib in Java

Avoid removal of spaces and newline while parsing html using jsoup

Converting HTML files into PDF

JAVA:How to store gujrati(its other language) font in string?

displaying WINDOWS-1252 encoded text from file as html

Categories

Resources