Java - Converting from unicode to a string? - java

I can easily create a unicode character and print it with the following lines of code
String uniChar = Character.toString((char)0000);
System.out.println(uniChar);
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.

I think you're looking for String#codePointAt:
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
For instance (live copy):
// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));
(toHex is just return Integer.toString(n, 16).toUpperCase();)
That outputs:
😊, code point = U+1F60A
😋, code point = U+1F60B

This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.
// array will contain one or two characters
char[] chars = Character.toChars(codePoint);
// string.length will be 1 or 2
String str = new String(chars);

Unicode is a numbering of "characters" - code points - upto a 3-byte int range.
The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:
int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);
To work with Unicode code points, one can do:
int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;
You code use an int array of 1 code point.
In java 8 one can get an IntStream of code points:
s.codePoints().forEach(cp -> {
System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};

Related

substring on Non Asci charater in java and Scala

I am not able to find a method in java or scala to do a substring on non-ascii character using the absolute length from getBytes
val string = "achâth33Franklin"
string.length
Int = 16
string.getBytes.length
Int = 17
string.substring(0,7)
String = achâth3
I need a method that results in achâth as it has non-ascii character whose length is 2
val test = "â"
test.getBytes.length
res26: Int = 2
To give more perspective on the problem.
The length of the field is constant which is 7, it always will be ascii value. Some times, they send non ascii value in the string.
The result substring(0,7), when they non-ascii values moving the next field values into current value.
Explination for #VGR
scala> val string = "achâth33Franklin"
string: String = achâth33Franklin
scala> new String(string.getBytes,0,7)
res30: String = achâth
scala> string.substring(0,7)
res31: String = achâth3
One way to do that is to combine the getBytes() method with this constructor.
So your method would look like this:
String string = "achâth33Franklin";
string.substring(0,7); //achâth3
new String(string.getBytes(), 0, 7)); //achâth
That constructor takes an array of bytes, an offset into the array, and the number of bytes to use. so new String(string.getBytes(), a, b) works with the same logic as string.substring(a, b), but per-byte instead of per-character.

Java: how to convert unicode string Emoji to Integer

I received an unicode string to contain Emoji code, example: "U+1F44F" (from Emoji table : http://apps.timwhitlock.info/emoji/tables/unicode).
I want to convert this string to an Integer how can I do that ?
I tried this, but it crashs:
int hex = Integer.parseInt(unicodeStr, 16);
Thanks guys!
The comment of #flakes gives the correct answere. The U+ only indicates that the following codepoint (or hex number) is a Unicode. The value you want to convert into an Integer is the codepoint, so you have to omit the 2 first characters with .substring(2)
You wil obtain the following code:
int hex = Integer.parseInt(unicodeStr.substring(2), 16);
Unicode numbers such "characters," code points, upto the 3 byte range, such as U+1F44F.
Java String has a constructor with code points.
int[] codepoints = { 0x1F44F };
String s = new String(codepoints, 0, codepoints.length);
public static String fromCodepoints(int... codepoints) {
return new String(codepoints, 0, codepoints.length);
}
s = fromCodepoints(0x1F44F, 0x102);
Java String contains Unicode as an internal array of chars. Every char '(2 bytes) being UTF-16 encoded. For lower ranges a char can be a code point. And U+0102 could be written as "\u0102" containing the char '\u0102'.
Note that emoji must be representable in the font.
Font font = ...
if (!font.canDisplay(0x1F44F)) {
...
}

Hash a String into fixed bit hash value

I want to hash a word into fixed bit hash value say 64 bit,32 bit (binary).
I used the following code
long murmur_hash= MurmurHash.hash64(word);
Then murmur_hash value is converted into binary by the following function
public static String intToBinary (int n, int numOfBits) {
String binary = "";
for(int i = 0; i < numOfBits; ++i) {
n/=2;
if(n%2 == 0)
{
binary="0"+binary;
}
else
binary="1"+binary;
}
return binary;
}
Is there any direct hash method to convert into binary?
Just use this
Integer.toBinaryString(int i)
If you want to convert into a fixed binary string, that is, always get a 64-character long string with zero padding, then you have a couple of options. If you have Apache's StringUtils, you can use:
StringUtils.leftPad( Long.toBinaryString(murmurHash), Long.SIZE, "0" );
If you don't, you can write a padding method yourself:
public static String paddedBinaryFromLong( long val ) {
StringBuilder sb = new StringBuilder( Long.toBinaryString(val));
char[] zeros = new char[Long.SIZE - sb.length()];
Arrays.fill(zeros, '0');
sb.insert(0, zeros);
return sb.toString();
}
This method starts by using the Long.toBinaryString(long) method, which conveniently does the bit conversion for you. The only thing it doesn't do is pad on the left if the value is shorter than 64 characters.
The next step is to create an array of 0 characters with the missing zeros needed to pad to the left.
Finally, we insert that array of zeros at the beginning of our StringBuilder, and we have a 64-character, zero-padded bit string.
Note: there is a difference between using Long.toBinaryString(long) and Long.toString(long,radix). The difference is in negative numbers. In the first, you'll get the full, two's complement value of the number. In the second, you'll get the number with a minus sign:
System.out.println(Long.toString(-15L,2));
result:
-1111
System.out.println(Long.toBinaryString(-15L));
result:
1111111111111111111111111111111111111111111111111111111111110001
Another other way is using
Integer.toString(i, radix)
you can get string representation of the first argument i in the radix ( Binary - 2, Octal - 8, Decimal - 10, Hex - 16) specified by the second argument.

Converting a int to char and then back to int - doesn't give same result always

I am trying to get a char from an int value > 0xFFFF. But instead, I always get back the same char value, that when cast to an int, prints the value 65535 (0xFFFF).
I couldn't understand why it is generating symbols for unicode > 0xFFFF.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF. Instead, the output comes back as 65535.
This is because, while an int is 4 bytes, a char is only 2 bytes. Thus, you can't represent all values in a char that you can in an int. Using a standard unsigned integer representation, you can only represent the range of values from 0 to 2^16 - 1 == 65535 in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you'll lose data.
int is 4 byte. char is 2 byte.
Your number was well within range an int can hold, but not which char can.
So when you converted that number to a char, it lost data and became the maximum a char can hold, which is what it printed i.e. 65535
Your number was too big to be a char which is 2 bytes. But it was small enough where it fit in as an int which is 4 bytes. 65535 is the biggest amount that fits in a char so that's why you got that value. Also, if a char was big enough to fit your number, when you returned it to an int it might have returned the decimal value for 0x10FFFF which is 1114111.
Unfortunately, I think you were expecting a Java char to be the same thing as a Unicode code point. They are not the same thing.
The Java char, as already expressed by other answers, can only support code points that can be represented in 16 bits, whereas Unicode needs 21 bits to support all code points.
In other words, a Java char on its own, only supports Basic Multilingual Plane characters (code points <= 0xFFFF). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points > 0xFFFF), then you need surrogate characters, or a pair of characters to do that. This is how UTF-16 works. And, internally, this is how Java strings work as well. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by 2 characters if the code point is > 0xFFFF:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("𢯭".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert an int value that represents a Unicode code point to a char (or chars to be more exact), and then convert it back to an int code point, you will have to use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating char arrays, you can use a String, like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class

BigInteger to String according to ASCII

Is there a way to convert a series of integers to a String according to the ASCII table. I want to take the ASCII value of a String and convert it back to a String. For example,
97098097=> "aba"
I really need an effective way of taking an integer and converting it to a String according to its ASCII value. This method must also take into account the fact that there is no zero in front of the '9' when the String "aba" has an ASCII value of 97098097 as 'a' has an ASCII value of 097 and a String "dee" has one of 100101101. This means that not every number will have an ASCII value that has a number of digits that is a multiple of three.
If you have any misunderstandings of what I'm trying to do please let me know.
No lookup table required.
while (string.length() % 3 != 0)
{
string = '0' + string;
}
String result = "";
for (int i = 0; i < string.length(); i += 3)
{
result += (char)(Integer.parseInt(string.substring(i, i + 3)));
}
First, I would create some sort of lookup table in your code with all the ascii values and their String equivalent. Then take the big int and convert it to a String. Then do the mod of 3 with the length of your bigint string to determine if you need to add 1, 2, or no 0's to the front of it. Then just grab every 3 integers from the front of the number, compare it to the lookup table, and append the corresponding value to your result string.
Example:
Given 97098097
You would convert it to: "97098097"
Then you do a mod with 3 resulting in a value of 1, so 1 zero needs to be added.
Append 1 zero: "097098097"
Then grab every 3 from the front and compare to look up table:
097 -> a, so result += "a"
098 -> b, so result += "b"
097 -> a, so result += "a"
You end with result being "aba"

Categories