How should I represent a single unicode character in Java?

How should I represent a single unicode character in Java? - java

I would like to represent a single Unicode character in Java. Which primitive or class that is appropriate for this?
Note that I want to be able to store any unicode character, which may be too large for a 2 byte char.

char is indeed 16-bit, a char corresponds to a UTF-16 code unit. Characters that don't fit in a single UTF-16 code unit (Emojis, for instance) require two chars.
If you need to store them individually for some reason, you can use an int for that. It has sufficient room (and then some) for all of the 0x10FFFF code points currently allowed in Unicode. That's what the JDK uses, for instance in Character.codePointAt(CharSequence seq, int index) and String(int[] codePoints, int offset, int count).
Gratuitous conversion example (live on ideone):
String s = "😂";
int emoji = Character.codePointAt(s, 0);
String unumber = "U+" + Integer.toHexString(emoji).toUpperCase();
System.out.println(s + " is code point " + unumber);
String s2 = new String(new int[] { emoji }, 0, 1);
System.out.println("Code point " + unumber + " converted back to string: " + s2);
System.out.println("Successful round-trip? " + s.equals(s2));
which outputs:
😂 is code point U+1F602
Code point U+1F602 converted back to string: 😂
Successful round-trip? true

Depends on the definition of a character:
If you mean one single Unicode code point, use int, which can hold every value from U+0000 to U+1FFFFF.
However, in some cases what appears as one character occupies multiple code points. This is especially common with emoji, eg.
skin tone: 🙋🏻 🙋🏿
country flags: 🇯🇵 🇺🇸
families: 👨‍👩‍👧‍👦, which becomes "👨+👩+👧+👦" if I replace the zero-width joiners (U+200D) with plus signs.
To store those the most logic way is using a String.

Related

why can't I add chars to a new string?

code:
String st = "abc";
String sl = st.charAt(0)+st.charAt(st.length()-1));
The second line is wrong for some reason and I don't know why

The book is wrong, and Eclipse is right.
In Java, you can write "abc" + whatever, or whatever + "abc", and it concatenates the strings -- because one side is a String.
But in st.charAt(0)+st.charAt(st.length()-1)), neither side is a String. They're both chars. So Java won't give you a String back.
Instead, Java will actually technically give you an int. Here are the gritty details from the Java Language Specification, which describes exactly how Java works:
JLS 4.2 specifies that char is considered a numeric type.
JLS 15.18.2 specifies what + does to values of numeric types.
In particular, it specifies that the first thing done to them is binary numeric promotion, which converts both chars to int by JLS 5.6.2. Then it adds them, and the result is still an int.
To get what you want to happen, probably the simplest solution is to write
String sl = st.charAt(0) + "" + st.charAt(st.length() - 1));

Because charAt returns char [ int ]
use this code :
String st = "abc";
StringBuilder str = new StringBuilder();
str.append(st.charAt(0));
str.append(st.charAt(st.length() - 1));
System.out.println(str.toString());
append method accept the char, or string, ..

well this is what it says: "- Type mismatch: cannot convert from int to String"
Meaning exactly what #Jaime said. If I remember correctly, a char is technically represented by an integer value. (i.e. 'a' + 1 = 'b'). So you're adding two char values, which isn't the same thing as adding two strings together, or even concatenating two char values. One fix would be to use String.valueOf(st.charAt(0)) + String.valueOf(st.charAt(st.length()-1))) to concatenate the two char values.

How can I zero-pad a hexadecimal digit string to eight digits?

I have a logic requirement, where I need to ensure that a hexadecimal digit string is presented in 8-digit format, even if the leading digits are zero. For example, the string corresponding to 0x3132 should be formatted as "0x00003132".
I tried this:
String key_ip = txt_key.getText();
int addhex = 0;
char [] ch = key_ip.toCharArray ();
StringBuilder builder = new StringBuilder();
for (char c : ch) {
int z = (int) c;
builder.append(Integer.toHexString(z).toUpperCase());
}
System.out.println("\ n (key) is:" + key_ip);
System.out.println("\ nkey in Hex:" + addhex + builder.toString());
, but it gave me an error. Can anyone explain how to fix or rewrite my code for this?
and I want to ask one more thing, if use code
Long.toHexString(blabla);
is it true to change the value "0x00" to "\0030" so that the output of 0 is 30

Evidently, you are receiving a String, converting its chars to their Unicode code values, and forming a String containing the hexadecimal representations of those code values. The problem you want to solve is to left-pad the result with '0' characters so that the total length is not less than eight. In effect, the only parts of the example code that are directly related to the problem itself are
int addhex = 0;
and
System.out.println("\ nkey in Hex:" + addhex + builder.toString());
. Everything else is just setup.
It should be clear, however, that that particular attempt cannot work, because all other considerations aside, you need something that adapts to the un-padded length of the digit string. That computation has no dependency on the length of the digit string at all.
Since you're already accumulating the digit string in a StringBuilder, it seems sensible to apply the needed changes to it, before reading out the result. There are several ways you could approach that, but a pretty simple one would be to just insert() zeroes one at a time until you reach the wanted length:
while (builder.length() < 8) {
builder.insert(0, '0'); // Inserts char '0' at position 0
}
I do suspect, however, that you may have interpreted the problem wrongly. The result you obtain from doing what you ask is ambiguous: in most cases where such padding is necessary, there are several input strings that could produce the same output. I am therefore inclined to guess that what is actually wanted is to pad the digits corresponding to each input character on a per-character basis, so that an input of "12" would yield the result "00310032". This would be motivated by the fact that Java char values are 16 bits wide, and it would produce a transformation that is reliably reversible. If that's what you really want, then you should be able to adapt the approach I've presented to achieve it (though in that case there are easier ways).
if use code
Long.toHexString(blabla);
is it true to change the value "0x00" to "\0030" so that the output of
0 is 30
The Unicode code value for the character '0', expressed in hexadecimal, is 30. Your method of conversion would produce that for the input string "0". Your method does not lend any special significance to the character '\' in its input.

Java - Converting from unicode to a string?

I can easily create a unicode character and print it with the following lines of code
String uniChar = Character.toString((char)0000);
System.out.println(uniChar);
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.

I think you're looking for String#codePointAt:
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
For instance (live copy):
// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));
(toHex is just return Integer.toString(n, 16).toUpperCase();)
That outputs:
😊, code point = U+1F60A
😋, code point = U+1F60B

This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.
// array will contain one or two characters
char[] chars = Character.toChars(codePoint);
// string.length will be 1 or 2
String str = new String(chars);

Unicode is a numbering of "characters" - code points - upto a 3-byte int range.
The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:
int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);
To work with Unicode code points, one can do:
int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;
You code use an int array of 1 code point.
In java 8 one can get an IntStream of code points:
s.codePoints().forEach(cp -> {
System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};

Converting a int to char and then back to int - doesn't give same result always

I am trying to get a char from an int value > 0xFFFF. But instead, I always get back the same char value, that when cast to an int, prints the value 65535 (0xFFFF).
I couldn't understand why it is generating symbols for unicode > 0xFFFF.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF. Instead, the output comes back as 65535.

This is because, while an int is 4 bytes, a char is only 2 bytes. Thus, you can't represent all values in a char that you can in an int. Using a standard unsigned integer representation, you can only represent the range of values from 0 to 2^16 - 1 == 65535 in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you'll lose data.

int is 4 byte. char is 2 byte.
Your number was well within range an int can hold, but not which char can.
So when you converted that number to a char, it lost data and became the maximum a char can hold, which is what it printed i.e. 65535

Your number was too big to be a char which is 2 bytes. But it was small enough where it fit in as an int which is 4 bytes. 65535 is the biggest amount that fits in a char so that's why you got that value. Also, if a char was big enough to fit your number, when you returned it to an int it might have returned the decimal value for 0x10FFFF which is 1114111.

Unfortunately, I think you were expecting a Java char to be the same thing as a Unicode code point. They are not the same thing.
The Java char, as already expressed by other answers, can only support code points that can be represented in 16 bits, whereas Unicode needs 21 bits to support all code points.
In other words, a Java char on its own, only supports Basic Multilingual Plane characters (code points <= 0xFFFF). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points > 0xFFFF), then you need surrogate characters, or a pair of characters to do that. This is how UTF-16 works. And, internally, this is how Java strings work as well. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by 2 characters if the code point is > 0xFFFF:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("𢯭".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert an int value that represents a Unicode code point to a char (or chars to be more exact), and then convert it back to an int code point, you will have to use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating char arrays, you can use a String, like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class

How can I check whether a byte array contains a Unicode string in Java?

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The array may be generated by code similar to:
byte[] utf8 = "Hello World".getBytes("UTF-8");
Alternatively it may have been generated by code similar to:
byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
messageContent[i] = (byte) i;
}
The key point is that we don't know what the array contains but need to find out in order to fill in the following function:
public final String getString(final byte[] dataToProcess) {
// Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
// If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
// If dataToProcess contains an encoded string then we will decode it and return.
}
How would this be extended to also cover UTF-16 or other encoding mechanisms?

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.
If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.
However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

Here's a way to use the UTF-8 "binary" regex from the W3C site
static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException
{
Pattern p = Pattern.compile("\\A(\n" +
" [\\x09\\x0A\\x0D\\x20-\\x7E] # ASCII\\n" +
"| [\\xC2-\\xDF][\\x80-\\xBF] # non-overlong 2-byte\n" +
"| \\xE0[\\xA0-\\xBF][\\x80-\\xBF] # excluding overlongs\n" +
"| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2} # straight 3-byte\n" +
"| \\xED[\\x80-\\x9F][\\x80-\\xBF] # excluding surrogates\n" +
"| \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2} # planes 1-3\n" +
"| [\\xF1-\\xF3][\\x80-\\xBF]{3} # planes 4-15\n" +
"| \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2} # plane 16\n" +
")*\\z", Pattern.COMMENTS);
String phonyString = new String(utf8, "ISO-8859-1");
return p.matcher(phonyString).matches();
}
As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.
As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.
And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.
UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.
A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.
Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)
The best we can do here is the following:
Test if the bytes are a valid UTF-8 encoding.
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?
In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.
IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.

In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.
I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"
For more help with Code Units and conversion try the website;
https://r12a.github.io/apps/conversion/
Here is code...
byte[] array_bytes = text.toString().getBytes();
char[] array_chars = text.toString().toCharArray();
System.out.println();
byteArrayToUtf8CodeUnits(array_bytes);
System.out.println();
charArrayToUtf16CodeUnits(array_chars);
public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
/*for (int k = 0; k < array.length; k++)
{
System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
}*/
System.out.println("array.length: = " + byte_array.length);
//------------------------------------------------------------------------------------------
for (int k = 0; k < byte_array.length; k++)
{
System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
}
//------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
/*Utf16 code units are also known as Java Unicode*/
System.out.println("array.length: = " + char_array.length);
//------------------------------------------------------------------------------------------
for (int i = 0; i < char_array.length; i++)
{
System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
}
//------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
//Returns hex String representation of byte b
char hexDigit[] =
{
'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
};
char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(array);
}
static public String charToHex(char c)
{
//Returns hex String representation of char c
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}

If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.
If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).
If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.

Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.

I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
Taken it from W3.org

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.