Getting true UTF-8 characters in Java JNI - java

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
Unfortunately GetStringUTFChars() almost does what's required but not quite, it returns a "modified" UTF-8 byte sequence. The main difference is that a modified UTF-8 doesn't contain any null characters (so you can treat is an ANSI C null terminated string) but another difference seems to be how Unicode supplementary characters such as emoji are treated.
A character such as U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES" is stored as a surrogate pair (two UTF-16 characters U+D83D U+DE04) and has a 4-byte UTF-8 equivalent of F0 9F 98 84, and that is the byte sequence that I get if I convert the string to UTF-8 in Java:
char[] c = Character.toChars(0x1F604);
String s = new String(c);
System.out.println(s);
for (int i=0; i<c.length; ++i)
System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));
byte[] b = s.getBytes("UTF-8");
for (int i=0; i<b.length; ++i)
System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF));
The code above prints the following:
😄
c[0] = 0xd83d
c[1] = 0xde04
b[0] = 0xf0
b[1] = 0x9f
b[2] = 0x98
b[3] = 0x84
However, if I pass 's' into a native JNI method and call GetStringUTFChars() I get 6 bytes. Each of the surrogate pair characters is being converted to a 3-byte sequence independently:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const char* sBytes = env->GetStringUTFChars(_s, NULL);
for (int i=0; sBytes[i]!=0; ++i)
fprintf(stderr, "%d: %02x\n", i, sBytes[i]);
env->ReleaseStringUTFChars(_s, sBytes);
return result;
}
0: ed
1: a0
2: bd
3: ed
4: b8
5: 84
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8. That in turn causes my native Mac code to crash because it's not a valid UTF-8 sequence:
CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8);
CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false);
I suppose I could change all my JNI methods to take a byte[] rather than a String and do the UTF-8 conversion in Java but that seems a bit ugly, is there a better solution?

This is clearly explained in the Java documentation:
JNI Functions
GetStringUTFChars
const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Modified UTF-8
The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.
All characters in the range \u0001 to \u007F are represented by a single byte, as follows:
The seven bits of data in the byte give the value of the character represented.
The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:
The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).
Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:
The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.
Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:
The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.
The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.
There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.
Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.
So, to answer your question...
Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
You can either:
Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.
Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const jclass stringClass = env->GetObjectClass(_s);
const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
const jstring charsetName = env->NewStringUTF("UTF-8");
const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
env->DeleteLocalRef(charsetName);
const jsize length = env->GetArrayLength(stringJbytes);
const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL);
for (int i = 0; i < length; ++i)
fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT);
env->DeleteLocalRef(stringJbytes);
}
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8
Java's Modified UTF-8 is not exactly the same as CESU-8:
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

Related

How to convert each char in string to 8 bit int? JAVA

I've been suggested a TCP-like checksum, which consists of the sum of the (integer) sequence and ack field values, added to a character-by-character sum of the payload field of the packet (i.e., treat each character as if it were an 8 bit integer and just add them together).
I'm assuming it would go along the lines of:
char[] a = data.toCharArray();
for (int i = 0; int < len; i++) {
...
}
Though I'm pretty clueless as to how I could complete the actual conversion?
My data is string, and I wish to go through the string (converted to a char array (though if there's a better way to do this let me know!)) and now I'm ready to iterate though how does one convert each character to an int. I will then be summing the total.
As String contains Unicode, and char is a two-byte UTF-16 implementation of Unicode, it might be better to first convert the String to bytes:
byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
data = new String(bytes, StandardCharsets.UTF_8); // Inverse.
int crc = 0;
for (byte b : bytes) {
int n = b & 0xFF; // An int 0 .. 255 without sign extension
crc ^= n;
}
Now you can handle any Unicode content of a String. UTF-8 is optimal when sufficient ASCII letters are used, like Chinese HTML pages. (For a Chinese plain text UTF-16 might be better.)

char[] into ascii and decimal value (double) java

based on this array :
final char[] charValue = { 'u', ' ', '}','+' };
i want to print the double value and the ascii value from it in Java.
i can't find a proper solution for that in internet. I just found how to convert a single Character into Integer value. But what about many characters?
the main problem is, i have a large char[] and some double and int values are stored in. for double values they are stored within 4 bytes size and integer 1 or 2 bytes so i have to read all this and convert into double or integer.
Thanks for you help
When java was designed, there was C char being used for binary bytes and text.
Java made a clear separation between binary data (byte[], InputStream/OutputStream) and Unicode text (char, String, Reader/Writer). Hence Java has full Unicode support. The binary data, byte[], need information: their used encoding, in order to be convertable to text: char[]/String.
In Java a char[] will rarely be used (as in C/C++), and it seems byte[] is intended, as you mention 4 elements to be used for an int etcetera. A char is 16 bits, containing UTF-16 text.
For this case one can use a ByteBuffer either wrapping a byte[] or being taken from a memory mapped file.
Writing
ByteBuffer buf = ByteBuffer.allocate(13); // 13 bytes
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
buf.putInt(42); // at 0
buf.putDouble(Math.PI); // at 4
buf.put((byte) '0'); // at 12
buf.putDouble(4, 3.0); // at 4 overwrite PI
byte[] bytes = buf.array();
Reading
ByteBuffer buf = ByteBuffer.wrap(bytes);
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
int a = buf.getInt();
double b = buf.getDouble();
byte c = buf.get();

Single UTF-8 char to byte

If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.?
byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';
Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8.
Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. It will always use UTF-8 to encode the string.
Note that UTF-8 is a variable-length character encoding. Only the first 127 characters are encoded in one byte; all other characters will take up between 2 and 6 bytes.
Your third line casts a char to an int. This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character.
In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code:
import java.nio.charset.Charset;
public class EncodingTest {
private static void checkCharacterConversion(String c) {
byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
byte asDefaultEncoding = c.getBytes()[0];
byte directConversion = (byte)c.charAt(0);
if (asUtf8 != asDefaultEncoding) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and default encoding %d",
c, asUtf8, asDefaultEncoding));
}
if (asUtf8 != directConversion) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and direct as byte %d",
c, asUtf8, directConversion));
}
}
public static void main(String[] argv) {
// btw: first time I ever wrote a for loop with a char - feels weird to me
for (char c = '\0'; c <= '\u007f'; c++) {
String cc = new String(new char[] {c});
checkCharacterConversion(cc);
}
}
}
If you run this e.g. with:
java -Dfile.encoding="UTF-16LE" EncodingTest
you will get no output.
But of course every single byte (ok, except for the first) will be wrong if you try:
java -Dfile.encoding="UTF-16BE" EncodingTest
because in "big endian" the first byte is always zero for ascii chars.
That is because in UTF-16 an ascii character '\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy]
However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. This will not be true for any further characters, however; they all have multi-byte representations in UTF-8.

Type cast issue in java (ascii value to char)

The problem I am facing occurs when I try to type cast some ASCII values to char.
For example:
(char)145 //returns ?
(char)129 //also returns ?
but it is supposed to return a different character. It happens to many other values as well.
I hope I have been clear enough.
ASCII is a 7-bit encoding system. Some programs even use this to detect if a file is binary or textual. Characters below 32 are escape characters and are used as directives (for instance new lines, print command)
The program however will still work. A character is simply stored as a short (thus sixteen bits). But the values in that range don't have an interpretation. This means that the textual output of both values will lead to nothing. On the other hand comparisons like (char) 145 == (char) 129 will still work (return false). Simply because for a processor, there is no difference between a short and a character.
If you are interested in converting your value such that only the lowest seven bits count (this modifying the value such that it is in the valid range), you can use masking:
int value = 145;
value &= 0x7f;
char c = (char) value;
The char type is Unicode 16 bit, UTF-16. So you could do (char) 265 for c-with-circumflex. ASCII is 7 bits 0 - 127.
String s = "" + ((char)145) + ((char)129);
The above is a string of two Unicode characters (each 2 bytes, UTF-16).
byte[] bytes = s.getBytes(StandardCharsets.US_ASCII); // ASCII with '?' as 7bit
s = new String(bytes, StandardCharsets.US_ASCII); // "??"
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1); // ISO-8859-1 with Latin1
byte[] bytes = s.getBytes("Windows-1252"); // With Windows Latin1
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // No information loss.
s = new String(bytes, StandardCharsets.UTF_9); // Orinal string.
In java String/char/Reader/Writer tackle text (in Unicode), whereas byte[]/InputStream/OutputStream tackle binary data, bytes.
And for bytes must always be associated with an encoding to give text.
Answer: as soon as there is a conversion from text to some encoding that does not represent that char, a question mark can be written.
These expressions evaluate to true:
((char) 145) == '\u0091';
((char) 129) == '\u0081';
These UTF-16 values map to the Unicode code points U+0091 and U+0081:
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
These are both control characters without visible graphemes (the question mark acts as a substitution character) and one of them is private use so has no designated purpose. Neither are in the ASCII set.

Convert ICU4C byte to java char

I am accessing an ICU4C function through JNI which returns a UChar * (i.e. unicode character array).... I was able to convert that to jbyteArray by equating each member of the UChar array to a local jbyte[] array that I created and then I returned it to Java using the env->SetByteArrayRegion() function... now I have the Byte[] array in Java but it's all gibberish pretty much.. Weird symbols at best... I am not sure where the problem might be... I am working with unicode characters if that matters... how do I convert the byte[] to a char[] in java properly? Something is not being mapped right... Here is a snippet of the code:
--- JNI code (altered slighter to make it shorter) ---
static jint testFunction(JNIEnv* env, jclass c, jcharArray srcArray, jbyteArray destArray) {
jchar* src = env->GetCharArrayElements(srcArray, NULL);
int n = env->getArrayLength(srcArray);
UChar *testStr = new UChar[n];
jbyte destChr[n];
//calling ICU4C function here
icu_function (src, testStr); //takes source characters and returns UChar*
for (int i=0; i<n; i++)
destChr[i] = testStr[i]; //is this correct?
delete testStr;
env->SetByteArrayRegion(destArray, 0, n, destChr);
env->ReleaseCharArrayElements(srcArray, src, JNI_ABORT);
return (n); //anything for now
}
-- Java code --
string wohoo = "ABCD bal bla bla";
char[] myChars = wohoo.toCharArray();
byte[] myICUBytes = new byte[myChars.length];
int value = MyClass.testFunction (myChars, myICUBytes);
System.out.println(new String(myICUBytes)) ;// produces gibberish & weird symbols
I also tried: System.out.println(new String(myICUBytes, Charset.forName("UTF-16"))) and it's just as gebberishy....
note that the ICU function does return the proper unicode characters in the UChar *... somewheres between the conversion to jbyteArray and Java that is is messing up...
Help!
destChr[i] = testStr[i]; //is this correct?
This looks like an issue all right.
JNI types:
byte jbyte signed 8 bits
char jchar unsigned 16 bits
ICU4C types:
Define UChar to be wchar_t if that is
16 bits wide; always assumed to be
unsigned.
If wchar_t is not 16 bits wide, then
define UChar to be uint16_t or
char16_t because GCC >=4.4 can handle
UTF16 string literals. This makes the
definition of UChar platform-dependent
but allows direct string type
compatibility with platforms with
16-bit wchar_t types.
So, aside from anything icu_function might be doing, you are trying to fit a 16-bit value into an 8-bit-wide type.
If you must use a Java byte array, I suggest converting to the 8-bit char type by transcoding to a Unicode encoding.
To paraphrase some C code:
UChar *utf16 = (UChar*) malloc(len16 * sizeof(UChar));
//TODO: fill data
// convert to UTF-8
UConverter *encoding = ucnv_open("UTF-8", &status);
int len8 = ucnv_fromUChars(encoding, NULL, 0, utf16, len16, &status);
char *utf8 = (char*) malloc(len8 * sizeof(char));
ucnv_fromUChars(encoding, utf8, len8, utf16, len16, &status);
ucnv_close(encoding);
//TODO: char to jbyte
You can then transcode this to a Java String using new String(myICUBytes, "UTF-8").
I used UTF-8 because it was already in my sample code and you don't have to worry about endianness. Convert my C to C++ as appropriate.
Have you considered using ICU4J?
Also, when converting your bytes to a string, you will need to specify a character encoding. I'm not familiar with the library in question, so I can't advise you further, but perhaps this will be "UTF-16" or similar?
Oh, and it's also worth noting that you might simply be getting display errors because the terminal you're printing to isn't using the correct character set and/or doesn't have the right glyphs available.

Categories