99.9% of the pages in my application are using UTF-8 encoding.
However for some special usecase in the client side, I need one of them to use Unicode (2 bytes for each character)
For that matter the header of this page is:
<%# page language="java" contentType="text/html; charset=unicode"%>
...<my content>...
This implementation works fine and do the job, when the application is run on Tomcat and Webspher. However when it is deployed on Weblogic, I get the server error:
unsupported encoding: 'unicode': java.io.UnsupportedEncodingException: unicode
Does someone know how I can force Weblogic to send pages in 'Unicode' encoding?
UTF-8 is Unicode. "Unicode" is not a character encoding at its own, it is a character mapping standard (a charset). Your problem lies somewhere else. Maybe you've had problems with GET request encoding. This is often overlooked. You may then find this article useful to get more background information and complete solutions how to get the Unicode phenomenon to work in a Java EE webapplication: Unicode - How to get the characters right?
Good luck.
By the way, the "2 bytes per character" is characteristic for the majority of the UTF-16 encoding (0x0000 until with 0xFFFF are represented in 2 bytes, while UTF-8 uses 1, 2 and 3 bytes for each of the subranges). Maybe you just meant to use it instead?
Unicode is not a charset, but there are charsets allowing to represent characters to be represented in the Unicode system. You know already the UTF-8 charset, which encodes each character with 1, 2, 3 or 4 bytes, depending on the position of the character in the system. It seems that you want to use the UTF-16 charset, which encodes each character with 2 or 4 bytes.
Note related to the answer provided by BalusC: here I use the word "charset" as "denominator for the character set encoding part in the Content-Type MIME header". Strictly speaking, the Universal Character Set provided by Unicode is a character set, but we don't strictly specify a character set with the charset moniker.
Related
I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.
Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?
I suppose you mean percent encoding here.
RFC 3986, section 2.5 is pretty clear about this (emphasis mine):
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
Therefore, this should be UTF-8.
Also, beware of URLEncoder.encode(); while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:
This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format
which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.
A reference from a HTML point of view.
The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):
We recommend that user agents adopt the following convention for
handling non-ASCII characters in such cases:
Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of
the byte value).
Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset attribute is specified.
On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding setting.
UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.
Go for UTF-8, also you can achieve the same thing by
URLEncoder.encode(string, encoding)
In addition, you can refer
This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'
Encode your URL to escape special characters. There are several websites that can do this for you.
E.g. http://www.url-encode-decode.com/
I am trying to make a Java application and a VS C++ application communicate and send different messages to each other using Sockets. The only problem that I have so far - I am absolutely lost in their encodings.
By default Java uses UTF-8. This is as far as I am concerned a Unicode charset. In my VS project I have settings set to Unicode. Though for some reason when I debug my code I allways see my strings encoded as CP1252 in memory.
Furthermore if I try to use CP1252 in Java it works fine for English letters, but whenever I try some russian letters I get a 3f byte for every letter.
If on other hand I try to use UTF-8 in Java - each English letter is 1 byte long, but every Russian - 2 bytes long. Isnt it a multibyte encoding?
Some docs on C++ say that std::string(char) uses UTF-8 codepage, and std:wstring(wchar_t) - UTF-16. When I debug my application I see CP1252 encoding for both of them, though wstring has empty bytes between each letter.
Could you please explain how encodings behave in both Java and C++ and how should I communicate my 2 apps?
UTF-8 has a variable-length per character. Common characters take less space by using up less bytes per character. More un-common characters take up more space because they have to be encoded in more bytes. Since most of this was invented in the US, guess which characters are shorter and which are longer?
If you want Sockets to work, then you will have to get both sides to agree on the encoding. Otherwise, you are fighting a loosing battle.
it's not true that java do utf-8 encoding. You can write your source code in utf8 and compile it with some weird signs in attributes(sometimes really annoying).
The internal representation in java of strings is utf-16(see What is the Java's internal represention for String? Modified UTF-8? UTF-16?)
Unicode is a character set, UTF-8 and UTF-16 are encodings of Unicode. For English (actually ASCII) characters UTF-8 results in the same value as CP1252 and UTF-16 adds a zero byte. As you want to use Russian (Cyrillic) you can use UTF-8, UTF-16 or CP1251. But both applications must agree on the encoding.
For example, if you agreed on UTF-8, the following will convert a Java String s to an array of bytes using UTF-8:
byte[] b = s.getBytes("UTF-8");
Then:
outputStream.write(b);
will send the data on the socket.
We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?
Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [getBytes()] when this string cannot be
encoded in the given charset is unspecified. The CharsetEncoder class
should be used when more control over the encoding process is
required.
I think that it will be better to print character code like this way:
System.out.println((int)'”');//result is 8221
This link can help you to explain this extraordinary double quote (include html code).
To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Update your web server to send an appropriate charset argument in
the Content-Type header. The correct header would be Content-Type:
text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to
the head section of your page.
Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.
I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.
I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?
For example, for the special character 'ndash' i see in the response it encoded as '\u2013'
To clarify the differences between Unicode and a character encoding
Unicode
is an abstract concept aiming to identify all letters (currently > 110 000).
Character encoding
defines how a character can be represending by a sequence of bytes
one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character
A java String is always UTF-16. Hence when you construct a String you can use the following String constructor
new String(byte[], encoding)
The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.
You can manually set the default encoding as an argument when starting the JVM
-Dfile.encoding="utf-8"
Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.
Currently I am using utf-8 for URL encoding. I want to convert it to UTF-16.
How can I achieve this?
When encoding Unicode characters in URLs, it's necessary to encode them in such a fashion that all URL parsers and consumers can understand your URLs.
To that end; when the URL was expanded by RFCs in the wake of the development of Unicode and related standards and tools, it was decided that the encoding to employ for encoding characters (using percent escapes) was to be UTF-8, as this would mean that established ASCII escapes would Just Work™.
Consequently, even if you could generate URLs with UTF-16-based percent escapes, no other program would be able to understand them, making them useless. In fact, by matter of definition, they wouldn't even be URLs.
There's also the question of why on earth you would want to use UTF-16 for anything, it being silly and all.
Remember: Never Don't Use UTF-8! (N'DUUH!)
URL escapes, as in %nn hex values, encode bytes. 8-bit bytes. If for some very nonstandard reason you want to encode bytes of UTF-16 instead of UTF-8, you must first pick a byte order (BE or LE). Then you have to write code in your program to take the two bytes of each 16-bit UTF-16 character and represent it as %nn in hex.