Java: how to convert unicode string Emoji to Integer - java

I received an unicode string to contain Emoji code, example: "U+1F44F" (from Emoji table : http://apps.timwhitlock.info/emoji/tables/unicode).
I want to convert this string to an Integer how can I do that ?
I tried this, but it crashs:
int hex = Integer.parseInt(unicodeStr, 16);
Thanks guys!

The comment of #flakes gives the correct answere. The U+ only indicates that the following codepoint (or hex number) is a Unicode. The value you want to convert into an Integer is the codepoint, so you have to omit the 2 first characters with .substring(2)
You wil obtain the following code:
int hex = Integer.parseInt(unicodeStr.substring(2), 16);

Unicode numbers such "characters," code points, upto the 3 byte range, such as U+1F44F.
Java String has a constructor with code points.
int[] codepoints = { 0x1F44F };
String s = new String(codepoints, 0, codepoints.length);
public static String fromCodepoints(int... codepoints) {
return new String(codepoints, 0, codepoints.length);
}
s = fromCodepoints(0x1F44F, 0x102);
Java String contains Unicode as an internal array of chars. Every char '(2 bytes) being UTF-16 encoded. For lower ranges a char can be a code point. And U+0102 could be written as "\u0102" containing the char '\u0102'.
Note that emoji must be representable in the font.
Font font = ...
if (!font.canDisplay(0x1F44F)) {
...
}

Related

How does encoding/decoding bytes work in Java?

Little background: I'm doing cryptopals challenges and I finished https://cryptopals.com/sets/1/challenges/1 but realized I didn't learn what I guess is meant to be learned (or coded).
I'm using the Apache Commons Codec library for Hex and Base64 encoding/decoding. The goal is to decode the hex string and re-encode it to Base64. The "hint" at the bottom of the page says "Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing."
Here's my answer...
private static Hex forHex = new Hex();
private static Base64 forBase64 = new Base64();
public static byte[] hexDecode(String hex) throws DecoderException {
byte[] rawBytes = forHex.decode(hex.getBytes());
return rawBytes;
}
public static byte[] encodeB64(byte[] bytes) {
byte[] base64Bytes = forBase64.encode(bytes);
return base64Bytes;
}
public static void main(String[] args) throws DecoderException {
String hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
//decode hex String to byte[]
byte[] myHexDecoded = hexDecode(hex);
String myHexDecodedString = new String(myHexDecoded);
//Lyrics from Queen's "Under Pressure"
System.out.println(myHexDecodedString);
//encode myHexDecoded to Base64 encoded byte[]
byte[] myHexEncoded = encodeB64(myHexDecoded);
String myB64String = new String(myHexEncoded);
//"pretty printing" of base64
System.out.println(myB64String);
}
...but I feel like I cheated. I didn't learn how to decode bytes that were encoded as hex, and I didn't learn how to encode "pure" bytes to Base64, I just learned how to use a library to do something for me.
If I were to take a String in Java then get its bytes, how would I encode those bytes into hex? For example, the following code snip turns "Hello" (which is readable English) to the byte value of each character:
String s = "Hello";
char[] sChar = s.toCharArray();
byte[] sByte = new byte[sChar.length]
for(int i = 0; i < sChar.length; i++) {
sByte[i] = (byte) sChar[i];
System.out.println("sByte[" + i + "] = " +sByte[i]);
}
which yields sByte[0] = 72, sByte[1] = 101, sByte[2] = 108, sByte[3] = 108, sByte[4] = 111
Lets use 'o' as an example - I am guessing its decimal version is 111 - do I just take its decimal version and change that to its hex version?
If so, to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
No. You take the characters 2 at a time, transform the character '0' to the numeric value 0, the character '1' to the numeric value 1, ..., the character 'a' (or 'A', depending on which encoding you want to support) to the numeric value 10, ..., the character 'f' or 'F' to the numeric value 15.
Then you multiply the first numeric value by 16, and you add it to the second numeric value to get the unsigned integer value of your byte. Then you transform that unsigned integer value to a signed byte.
ASCII has nothing to do with this algorithm.
To see how it's done in practice, since commons-codec is open-source, you can just look at its implementation.

Java - Converting from unicode to a string?

I can easily create a unicode character and print it with the following lines of code
String uniChar = Character.toString((char)0000);
System.out.println(uniChar);
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.
I think you're looking for String#codePointAt:
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
For instance (live copy):
// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));
(toHex is just return Integer.toString(n, 16).toUpperCase();)
That outputs:
😊, code point = U+1F60A
😋, code point = U+1F60B
This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.
// array will contain one or two characters
char[] chars = Character.toChars(codePoint);
// string.length will be 1 or 2
String str = new String(chars);
Unicode is a numbering of "characters" - code points - upto a 3-byte int range.
The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:
int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);
To work with Unicode code points, one can do:
int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;
You code use an int array of 1 code point.
In java 8 one can get an IntStream of code points:
s.codePoints().forEach(cp -> {
System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};

Converting a int to char and then back to int - doesn't give same result always

I am trying to get a char from an int value > 0xFFFF. But instead, I always get back the same char value, that when cast to an int, prints the value 65535 (0xFFFF).
I couldn't understand why it is generating symbols for unicode > 0xFFFF.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF. Instead, the output comes back as 65535.
This is because, while an int is 4 bytes, a char is only 2 bytes. Thus, you can't represent all values in a char that you can in an int. Using a standard unsigned integer representation, you can only represent the range of values from 0 to 2^16 - 1 == 65535 in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you'll lose data.
int is 4 byte. char is 2 byte.
Your number was well within range an int can hold, but not which char can.
So when you converted that number to a char, it lost data and became the maximum a char can hold, which is what it printed i.e. 65535
Your number was too big to be a char which is 2 bytes. But it was small enough where it fit in as an int which is 4 bytes. 65535 is the biggest amount that fits in a char so that's why you got that value. Also, if a char was big enough to fit your number, when you returned it to an int it might have returned the decimal value for 0x10FFFF which is 1114111.
Unfortunately, I think you were expecting a Java char to be the same thing as a Unicode code point. They are not the same thing.
The Java char, as already expressed by other answers, can only support code points that can be represented in 16 bits, whereas Unicode needs 21 bits to support all code points.
In other words, a Java char on its own, only supports Basic Multilingual Plane characters (code points <= 0xFFFF). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points > 0xFFFF), then you need surrogate characters, or a pair of characters to do that. This is how UTF-16 works. And, internally, this is how Java strings work as well. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by 2 characters if the code point is > 0xFFFF:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("𹯭".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert an int value that represents a Unicode code point to a char (or chars to be more exact), and then convert it back to an int code point, you will have to use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating char arrays, you can use a String, like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class

Convert String with ascii value to normal string

Here is my code :
int availableBytes = inputStream.available();
if (availableBytes > 0) {
inputStream.read(readBuffer, 0, availableBytes);
System.out.println(new String(readBuffer, 0, availableBytes));
Reponse = new String(readBuffer, "UTF-8");
System.out.println(Reponse);
My question :
So I get in my "Reponse" variable, of type String, ascii value well I think because when I do the sysout of "Reponse" it shows me a 3 "squares with a question mark in".
So is it possible to convert this String value with ascii value in integer ?
Java String is not a sequence of bytes - signed 8-bit, but of chars - unsigned 16-bit. Also read the javadoc for your constructor:
Constructs a new String by decoding the specified subarray of bytes
using the platform's default charset. The length of the new String is
a function of the charset, and hence may not be equal to the length of
the subarray.
It does not work as you probably expect - it does not convert each byte into a character!

Save a hex String to File.hex in java

I have a String which contains hex values. Now i want to write this exact string to a file with the ending .hex . How can i realize this in java?
I already tried to convert the Hex Values into ASCII and then write this string into a file.
But all Hex Values which are higher then 127(dec) can't be processed correctly.
86(hex) is transformed to ?(char), which is 3F(hex) and not 86(hex).
You can try to take each char of your string, convert it to integer and then write values in bytes in a file. To do the opposite process, you just have to read the file into a byte array and convert each byte into a char to retrieve your string. Then I'm sure you can find some algorithm to cast your string into Hex string.
For me the Answer was this:
Under Projectproperties i needed to set the Text-file-Encoding to ISO-8859-1.
Then my old procedure worked very well.
public static String hexToASCII(String hex){
if(hex.length()%2 != 0){
System.err.println("requires EVEN number of chars");
return null;
}
StringBuilder sb = new StringBuilder();
for( int i=0; i < hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
sb.append((char)decimal);
}
return sb.toString();
}

Categories