String Encoding with Emoji in Java? - java

I have small test example like this
public class Main {
public static void main(String[] args) {
String s = "πŸ‡»πŸ‡Ί";
System.out.println(s);
System.out.println(s.length());
System.out.println(s.toCharArray().length);
System.out.println(s.getBytes(StandardCharsets.UTF_8).length);
System.out.println(s.getBytes(StandardCharsets.UTF_16).length);
System.out.println(s.codePointCount(0, s.length()));
System.out.println(Character.codePointCount(s, 0, s.length()));
}
}
And result is:
πŸ‡»πŸ‡Ί
4
4
8
10
2
2
I can not understand, why 1 unicode character Vanuatu flag return 4 of length, 8 bytes in utf-8 and 10 bytes in utf-16, I know java using
UTF-16 and it need 1 char(2 byte) for 1 code point but it make me confusing about 4 char for 1 unicode character, i think it just need 2 char but result 4. Someone can fully explain to help me understand about this. Many thanks.

Unicode flag emojis are encoded as two code points.
There are 26 Regional Indicator Symbols representing A-Z, and a flag is encoded by spelling out the ISO country code. For example, the Vanuatu flag is encoded as "VU", and the American flag is "US".
The indicators are all in the supplemental plane, so they each require two UTF-16 characters. This brings the total up to 4 Java char per flag.
The purpose of this is to avoid having to update the standard whenever a country gains or loses independence, and it helps the Unicode consortium stay neutral since it doesn't have to be an arbiter of geopolitical claims.

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per Unicode character. The first byte carries from 3 to 7 bits of the character, and each subsequent byte carries 6 bits. Thus there's from 7 to 21 bits of payload.
The number of bytes needed depends on the particular character.
See this Wikipedia page for the encoding.
UTF-16 uses either one 16-bit unit or two 16-bit units for a Unicode character. Approximately speaking, characters in the first 64K characters are encoded as one unit; characters outside that range need two units.
"Approximately" because, actually, the codes that fit in one 16-bit unit are either in U+0000 to U+D7FF, or U+E000 to U+FFFF. The values in between those two are used for the two-unit format.
The number of 16-bit units needed depends on the particular character.
See this other Wikipedia page.

Related

Why character can't print after 128 number

In my project i am try to convert Binary number to integer and convert integer to Character. But after 128 number print only '?' character. Please help me how to print up to 250 characters. My code is
class b
{
public static void main(String[] args)
{
String dec1="11011001" ;
System.out.println(dec1);
int dec = Integer.parseInt(dec1, 2);
System.out.println(dec);
String str = new Character((char)dec).toString();
System.out.println("decrypted number is "+str);
}
}
Thank you.
Not all byte values have a printable character associated with them, ASCII does not, many/most unicode bytes do not, and the range 0x00 - 0x1f are all unprintable controls such as DC1, Bell, Backspace, etc. Unicode has the same first 32 characters reserved as non-printable.
Byte values above 127 (0x7f) have different meanings in different encodings, there are many encodings. Historically ASCII was the default encoding and there were many extensions to it. These days the standard is unicode which exists in several varieties including UTF-8, UTF-16 (LE, BE and BOM) and UTF-32 (LE, BE and BOM). UTF8 is common for interchange especially over the net and UTF-16 internally in many systems.
Depending on the encoding and glyph (displayed representation) it may take from one to over 16 bytes to represent a single glyph. Emoji mostly are in code plane 1 meaning that they require more than 16-bits for their code point (unicode is a 21-bit encoding system). Additionally some glyphs are represented by a sequence of code points, examples are flags which combine a country with the flag and Emoji joined with "joiners".
In the case of 217 (0xd9) that is not a legal codepoint in UTF-8 but 217 as two bytes (16-bit integer) (0x00d9) is a valid representation of Γ™.
See ASCII and Unicode.
As per your code,First the binary will be converted to Integer and Then you are converting Integer to the Character which is done by checking the ASCII value. It will return the character having same ASCII value as the Integer dec1 you are converting. Since in ASCII TABLE the values are upto 127, You will get the character upto the integer value 127, So for the greater value of dec1 than 127, You will get character as ? which will be then converted into String. First 32 elements are non-printable characters so you will get some strange symbol for it but for value of dec1 in the range 32-126, You will get the character assigned to that particular ASCII value as per ASCII TABLE. Since the value 127 is assigned to DEL, you will get strange symbol for value of dec 127.
The issue is that your console's encoding doesn't match the encoding of the output of your Java program. I don't know what console you're using, but on Windows, you can run this command to see your current encoding:
chcp
The default console's encoding for USA is 437 and for Western Europe and Canada 850. These encodings have the 128 characters from ASCII encoding and 128 additional characters that are different from one encoding to another. You get nothing beyond the 128 ASCII characters because your Java output's encoding doesn't match the console's encoding. You have to change one of them to match the other.
You can change your console's encoding to UTF-8 by running this command:
chcp 65001
If you're not on Windows, you'll have to search for the equivalent commands for your system. But I believe on most Linux & Unix derived systems, you can use the locale command to see the current encoding and the export command to change it.
I receive the following output from your code. I assume that you run the program in an environment/console that doesn't support the character. You need a console that support UTF-8, UTF-16 or similar to be able to print all characters you setup numerical values for.
11011001
217
decrypted number is Γ™

Unicode character length in bytes - always the same?

I defined a unicode character as a byte array:
private static final byte[] UNICODE_MEXT_LINE = Charsets.UTF_8.encode("\u0085").array();
At the moment byte array length is 3, is it safe to assume the length of the array is always 3 across platforms?
Thank you
It's safe to assume that that particular character will always be three bytes long, regardless of platform.
But unicode characters in UTF-8 can be one byte, two bytes, three bytes or even four bytes long, so no, you can't assume that if you convert any character to UTF-8 then it'll come out as three bytes.
That particular character will always be 3 bytes in length, but others will be different. Unicode characters are anywhere from 1-4 bytes long. The 8 in 'UTF-8' just means that it uses 8-bit code units.
The Wikipedia page on UTF-8 provides a pretty good overview of how that works. Basically, the first bits of the first byte tell you how many bytes long that character will be. For instance, if the first bit of the first byte is a 0 as in 01111111, then that means this character is only one byte long (in utf-8, these are the ascii characters). If the first bits are 110 as in 11011111, then that tells you that this character will be two bytes long. The chart in the Wikipedia page provides a good illustration of this.
There's also this question, which has some good answers as well.

What is a "surrogate pair" in Java?

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.
Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values β€” called code points β€” were adopted for the UTF-32 encoding scheme.
But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates(in the range U+D800 to U+DBFF) and another 1,024 values to 16-bit low surrogates(in the range U+DC00 to U+DFFF). It uses a high surrogate followed by a low surrogate β€” a surrogate pair β€” to represent (the product of 1,024 and 1,024)1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .
Adding some more info to the above answers from this post.
Tested in Java-12, should work in all Java versions above 5.
As mentioned here: https://stackoverflow.com/a/47505451/2987755,
whichever character (whose Unicode is above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values, i.e. the single Unicode character is represented as two adjacent Java characters.
As we can see in the following example.
1. Length:
"πŸŒ‰".length() //2, Expectations was it should return 1
"πŸŒ‰".codePointCount(0,"πŸŒ‰".length()) //1, To get the number of Unicode characters in a Java String
2. Equality:
Represent "πŸŒ‰" to String using Unicode \ud83c\udf09 as below and check equality.
"πŸŒ‰".equals("\ud83c\udf09") // true
Java does not support UTF-32
"πŸŒ‰".equals("\u1F309") // false
3. You can convert Unicode character to Java String
"πŸŒ‰".equals(new String(Character.toChars(0x0001F309))) //true
4. String.substring() does not consider supplementary characters
"πŸŒ‰πŸŒ".substring(0,1) //"?"
"πŸŒ‰πŸŒ".substring(0,2) //"πŸŒ‰"
"πŸŒ‰πŸŒ".substring(0,4) //"πŸŒ‰πŸŒ"
To solve this we can use String.offsetByCodePoints(int index, int codePointOffset)
"πŸŒ‰πŸŒ".substring(0,"πŸŒ‰πŸŒ".offsetByCodePoints(0,1) // "πŸŒ‰"
"πŸŒ‰πŸŒ".substring(2,"πŸŒ‰πŸŒ".offsetByCodePoints(1,2)) // "🌐"
5. Iterating Unicode string with BreakIterator
6. Sorting Strings with Unicode java.text.Collator
7. Character's toUpperCase(), toLowerCase(), methods should not be used, instead, use String uppercase and lowercase of particular locale.
8. Character.isLetter(char ch) does not support, better used Character.isLetter(int codePoint), for each methodName(char ch) method in the Character class there will be type of methodName(int codePoint) which can handle supplementary characters.
9. Specify charset in String.getBytes(), converting from Bytes to String, InputStreamReader, OutputStreamWriter
Ref:
https://coolsymbol.com/emojis/emoji-for-copy-and-paste.html#objects
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
https://www.ibm.com/developerworks/library/j-unicode/index.html
https://www.oracle.com/technetwork/articles/javaee/supplementary-142654.html
More info on example image1 image2
Other terms worth to explore: Normalization, BiDi
What that documentation is saying is that invalid UTF-16 strings may become valid after calling the reverse method since they might be the reverses of valid strings. A surrogate pair (discussed here) is a pair of 16-bit values in UTF-16 that encode a single Unicode code point; the low and high surrogates are the two halves of that encoding.
Small preface
Unicode represents code points. Each code point can be encoded in 8-, 16,- or 32-bit blocks according to the Unicode standard.
Prior to the Version 3.1, mostly in use was 8-bit enconding, known as UTF-8, and 16-bit encoding, known as UCS-2 or β€œUniversal Character Set coded in 2 octets”. UTF-8 encodes Unicode points as a sequence of 1-byte blocks, while UCS-2 always takes 2 bytes:
A = 41 - one block of 8-bits with UTF-8
A = 0041 - one block of 16-bits with UCS-2
Ξ© = CE A9 - two blocks of 8-bits with UTF-8
Ξ© = 03A9 - one block of 16-bits with UCS-2
Problem
The consortium thought that 16 bits would be enough to cover any human-readable language, which gives 2^16 = 65536 possible code values. This was true for the Plane 0, also known as BMP or Basic Multilingual Plane, that includes 55,445 of 65536 code points today. BMP covers almost every human language in the world, including Chinese-Japanese-Korean symbols (CJK).
The time passed and new Asian character sets were added, Chinese symbols took more than 70,000 points alone. Now, there are even Emoji points as part of the standard 😺. New 16 "additional" Planes were added. The UCS-2 room was not enough to cover anything bigger than Plane-0.
Unicode decision
Limit Unicode to the 17 planes Γ— 65 536 characters per plane = 1 114 112 maximum points.
Present UTF-32, former known as UCS-4, to hold 32-bits for each code point and cover all planes.
Continue to use UTF-8 as dynamic encoding, limit UTF-8 to 4 bytes maximum for each code point, i.e. from 1 up to 4 bytes per point.
Deprecate UCS-2
Create UTF-16 based on UCS-2. Make UTF-16 dynamic, so it takes 2 bytes or 4 bytes per point. Assign 1024 points U+D800–U+DBFF, called High Surrogates, to UTF-16; assign 1024 symbols U+DC00–U+DFFF, called Low Surrogates, to UTF-16.
With those changes, BMP is covered with 1 block of 16 bits in UTF-16, while all "Supplementary characters" are covered with Surrogate Pairs presenting 2 blocks by 16 bits each, totally 1024x1024 = 1 048 576 points.
A high surrogate precedes a low surrogate. Any deviation from this rule is considered as a bad encoding. For example, a surrogate without a pair is incorrect, a low surrogate standing before a high surrogate is incorrect.
π„ž, 'MUSICAL SYMBOL G CLEF', is encoded in UTF-16 as a pair of surrogates 0xD834 0xDD1E (2 by 2 bytes),
in UTF-8 as 0xF0 0x9D 0x84 0x9E (4 by 1 byte),
in UTF-32 as 0x0001D11E (1 by 4 bytes).
Current situation
Although according to the standard the surrogates are specifically assigned only to UTF-16, historically some Windows and Java applications used UTF-8 and UCS-2 points reserved now to the surrogate range.
To support legacy applications with incorrect UTF-8/UTF-16 encodings, a new standard WTF-8, Wobbly Transformation Format, was created. It supports arbitrary surrogate points, such as a non-paired surrogate or an incorrect sequence. Today, some products do not comply with the standard and treat UTF-8 as WTF-8.
The surrogate solution opened some security problems, as well as attempts to use "illigal surrogate pairs".
Many historic details were suppressed to follow the topic βš–.
The latest Unicode Standard can be found at http://www.unicode.org/versions/latest
Surrogate pairs refer to UTF-16's way of encoding certain characters, see http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000..U.2B10FFFF
A surrogate pair is two 'code units' in UTF-16 that make up one 'code point'. The Java documentation is stating that these 'code points' will still be valid, with their 'code units' ordered correctly, after the reverse. It further states that two unpaired surrogate code units may be reversed and form a valid surrogate pair. Which means that if there are unpaired code units, then there is a chance that the reverse of the reverse may not be the same!
Notice, though, the documentation says nothing about Graphemes -- which are multiple codepoints combined. Which means e and the accent that goes along with it may still be switched, thus placing the accent before the e. Which means if there is another vowel before the e it may get the accent that was on the e.
Yikes!

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.
Main UTF-8 cons:
Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
Main UTF-16 pros:
BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding β€œmostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
Security: Use only UTF-8
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
CVE-2008-2938
CVE-2012-2135
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.

Java Unicode encoding

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?
Does this boil down to what character encoding you are using?
You can handle them all if you're careful enough.
Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).
See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
To add to the other answers, some points to remember:
A Java char takes always 16 bits.
A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).
"Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.
You said:
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.
Unicode grows
Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow β€” and not just because of emojis.
143,859 characters in Unicode 13 (Java 15, release notes)
137,994 characters in Unicode 12.1 (Java 13 & 14)
136,755 characters in Unicode 10 (Java 11 & 12)
120,737 characters in Unicode 8 (Java 9)
110,182 characters in Unicode 6.2 (Java 8)
109,449 characters in Unicode 6.0 (Java 7)
96,447 characters in Unicode 4.0 (Java 5 & 6)
49,259 characters in Unicode 3.0 (Java 1.4)
38,952 characters in Unicode 2.1 (Java 1.1.7)
38,950 characters in Unicode 2.0 (Java 1.1)
34,233 characters in Unicode 1.1.5 (Java 1.0)
char is legacy
The char type is long outmoded, now legacy.
Use code point numbers
Instead, you should be working with code point numbers.
You asked:
Does this mean that you can't handle certain Unicode characters in a Java application?
The char type can address less than half of today's Unicode characters.
To represent any Unicode character, use code point numbers. Never use char.
Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areas and will never be assigned.
The String class has gained methods for working with code point numbers, as did the Character class.
Get the code point number for any character in a string, by zero-based index number. Here we get 97 for the letter a.
int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.
For the more general CharSequence rather than String, use Character.codePointAt.
We can get the Unicode name for a code point number.
String name = Character.getName( 97 ) ; // letter `a`
LATIN SMALL LETTER A
We can get a stream of the code point numbers of all the characters in a string.
IntStream codePointsStream = "Cat".codePoints() ;
We can turn that into a List of Integer objects. See How do I convert a Java 8 IntStream to a List?.
List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;
Any code point number can be changed into a String of a single character by calling Character.toString.
String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A.
a
We can produce a String object from an IntStream of code point numbers. See Make a string from an IntStream of code point numbers?.
IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).
String output =
intStream
.collect( // Collect the results of processing each code point.
StringBuilder :: new , // Supplier<R> supplier
StringBuilder :: appendCodePoint , // ObjIntConsumer<R> accumulator
StringBuilder :: append // BiConsumer<R,​R> combiner
) // Returns a `CharSequence` object.
.toString(); // If you would rather have a `String` than `CharSequence`, call `toString`.
Cat 🐈
You asked:
Does this boil down to what character encoding you are using?
Internally, a String in Java is always using UTF-16.
You only use other character encoding when importing or exporting text in or out of Java strings.
So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to see that character, you must be using a font with a glyph defined for that particular character.
When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful.
Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.
In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
char is a UTF-16 code unit, not a code point
new low-level APIs use an int to represent a Unicode code point
high level APIs have been updated to understand surrogate pairs
a preference towards char sequence APIs instead of char based methods
Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.
Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
2 platform uses the UTF-16 representation in char arrays and in the
String and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP)
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. The lower (least significant) 21
bits of int are used to represent Unicode code points and the upper
(most significant) 11 bits must be zero. Unless otherwise specified,
the behavior with respect to supplementary characters and surrogate
char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate
ranges as undefined characters. For example,
Character.isLetter('\uD840') returns false, even though this specific
value if followed by any low-surrogate value in a string would
represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example,
Character.isLetter(0x2F81A) returns true because the code point value
represents a letter (a CJK ideograph).
From the OpenJDK7 documentation for String:
A String represents a string in the
UTF-16 format in which supplementary
characters are represented by
surrogate pairs (see the section
Unicode Character Representations in
the Character class for more
information). Index values refer to
char code units, so a supplementary
character uses two positions in a
String.

Categories