Saving Chinese characters using Java HtmlEditorKit - java

I'm trying to save HtmlDocument(saved with UTF-8 encoding) which contains Chinese character 𠜎 using HtmlEditorKit in the following way:
try (OutputStreamWriter f = new OutputStreamWriter(fileOutputStream, "UTF-8")) {
    htmlEditorKit.write(f, htmlDocument, 0, htmlDocument.getLength());
} catch (BadLocationException e) {
    logger.error("Could not save", e);
}
In output HTML doc I'm getting two 2 bytes characters(amp#55361;amp#57102;) instead of one 4 bytes character. Java can understand which symbol is it by combining both of them, but HTML can't.
Any suggestion on how to save it, so HTML page could be correctly displayed?
Here is output html:
<html>
<head>
<meta content="text/html" charset="utf-8">
</head>
<body>
<p>𠜎</p>
</body>
</html>

Related

Replace custom tags in html with Java

is there a library on Java to help me to achieve custom tags replacement in html
like for example here is a simple template :
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<p>$welcome_title</p>
<p>$email_body</p>
<p>$footer_text</p>
</div>
</body>
</html>
Can i replace this custom tags ($welcome_title,$email_body,$footer_text) with values from java ?
The idea is to have template with tags which can be replaced at runtime with values from java objects :)
Also maybe (if there is a library) to generate straight away from html an PDF doc
Thanks :)
In Java world you can use https://www.thymeleaf.org/ or https://freemarker.apache.org/

Can't show certain UTF-8 characters in android webview

I know this problem has been documented elsewhere but the solutions don't seem to work for me. Other similar questions:
Android WebView with garbled UTF-8 characters.
Android WebView UTF-8 not showing
I'm essentially trying to show the minus/plus character (∓) in an android webview. I tested several other characters 'around' the minus plus character in the UTF-8 table but some of them didn't work either
Here is the java im using:
final WebView w = (WebView) findViewById(R.id.webview1);
w.getSettings().setJavaScriptEnabled(true);
w.getSettings().setDefaultTextEncodingName("utf-8");
InputStream is;
try {
is = getAssets().open("test5.html");
int size = is.available();
byte[] buffer = new byte[size];
is.read(buffer);
is.close();
String str = new String(buffer);
w.loadData(str, "text/html; charset=utf-8", "utf-8");
Here is the html test5.html
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
char: ∓ <br/>
char: ∔ <br/>
char: ∕ <br/>
char: ∖ <br/>
char: ∗ <br/>
char: ∘ <br/>
</body>
The only characters that show up are the "∕" and "∗". I've also tried
w.loadDataWithBaseURL("file:///doesnotmatter", str, "text/html", "utf-8", "");
with no success. I'm not too familiar with the input stream thing so I don't know if there's something wrong there. Please help, its taken me awhile =\
-Teneth
When you call loadData(), pass the encoding as uppercase "UTF-8", because it is case sensitive and that's the standard representation according to the documentation.
use html codes for that characters in your html file:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
char: ∓<br/>
char: ∔<br/>
char: ∕<br/>
char: ∖<br/>
char: ∗<br/>
char: ∘<br/>
</body>

Servlet doesn't parse uploaded file as UTF-8

I'm having troubles with uploading and parsing a file as UTF-8 strings. I use the following code:
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
Part filePart = request.getPart("file");
InputStream filecontent = filePart.getInputStream();
// ...
}
And my webpage looks like this:
<%# page language="java" contentType="text/html; charset=UTF-8"
pageEncoding="UTF-8"%>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<form action="UploadServlet" method="post" enctype="multipart/form-data">
<input type="file" name="file" />
<input type="submit" />
</form>
</body>
</html>
I found a great post about UTF-8 encoding in java webapps, but unfortunately it didn't work for me. I still have random symbols in strings in NetBeans debugger, and when I display them on a webpage, although most of them get displayed correctly, some cyrillic letters (я, с, Н, А) get replaced by '�?'
The file upload with a HTML form doesn't use any character encoding. The file is transferred byte by byte as is. See here under "multipart/form-data".
So if the original file at client side is a text file with UTF-8 character encoding, then on the server side it is also UTF-8.
Then you can use an InputStreamReader to decode the bytes as UTF-8 text:
InputStreamReader reader = new InputStreamReader(filecontent, "UTF-8");
That's it.
javax.servlet.http.Part, what you use in the very first line of your code, has a method on it getContentType() which will tell you what the content type of the uploaded form data is. Nothing you have written to date would constrain the uploaded form data to any particular character set; ergo you need to determine the character set and deal with it accordingly.

Java check - charset, encoding of html page - like browsers do

How to check what really charset, encoding of some html page ?
For example, the charset of some html page is iso-8859-1, but the content of the html written with utf8
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
...
here is content with utf8
...
</html>
How to check it, Is it possible to check with java charset, encoding of html page,
like it's done in browsers ?
Thank you !

JEditorPane saves HTML using entities instead diacritics

I have a file, containing czech text common file split to two lines:
<html>
<head>
<meta http-equiv="contet-type" content="text/html; charset=UTF-8"/>
</head>
<body>
<p>Běžný</p>
<p>soubor</p>
</body>
</html>
When I load this file to JEditorPane using HTMLEditorKit and then save it (like having it edited), the underlying model (HTML code) is changed to:
<html>
<head>
<meta http-equiv="contet-type" content="text/html; charset=UTF-8"/>
</head>
<body>
<p style="margin-top: 0">Běžný</p>
<p style="margin-top: 0">soubor</p>
</body>
</html>
Is there some way to get out of margins and entities? Must I inevitably override some methods of HMLEditorKit?
PS: Is there some another embedable (and free) simple Java HTML (WYSIWYG-like) editor? But I need to handle some special tags from my own XML-namespace. (Ideally HTML 4.0 compliant.)
Please use Net Beans IDE 7.0.
Downloads free
http://netbeans.org/downloads/

Categories