Convert ICU4C byte to java char - java

I am accessing an ICU4C function through JNI which returns a UChar * (i.e. unicode character array).... I was able to convert that to jbyteArray by equating each member of the UChar array to a local jbyte[] array that I created and then I returned it to Java using the env->SetByteArrayRegion() function... now I have the Byte[] array in Java but it's all gibberish pretty much.. Weird symbols at best... I am not sure where the problem might be... I am working with unicode characters if that matters... how do I convert the byte[] to a char[] in java properly? Something is not being mapped right... Here is a snippet of the code:
--- JNI code (altered slighter to make it shorter) ---
static jint testFunction(JNIEnv* env, jclass c, jcharArray srcArray, jbyteArray destArray) {
jchar* src = env->GetCharArrayElements(srcArray, NULL);
int n = env->getArrayLength(srcArray);
UChar *testStr = new UChar[n];
jbyte destChr[n];
//calling ICU4C function here
icu_function (src, testStr); //takes source characters and returns UChar*
for (int i=0; i<n; i++)
destChr[i] = testStr[i]; //is this correct?
delete testStr;
env->SetByteArrayRegion(destArray, 0, n, destChr);
env->ReleaseCharArrayElements(srcArray, src, JNI_ABORT);
return (n); //anything for now
}
-- Java code --
string wohoo = "ABCD bal bla bla";
char[] myChars = wohoo.toCharArray();
byte[] myICUBytes = new byte[myChars.length];
int value = MyClass.testFunction (myChars, myICUBytes);
System.out.println(new String(myICUBytes)) ;// produces gibberish & weird symbols
I also tried: System.out.println(new String(myICUBytes, Charset.forName("UTF-16"))) and it's just as gebberishy....
note that the ICU function does return the proper unicode characters in the UChar *... somewheres between the conversion to jbyteArray and Java that is is messing up...
Help!

destChr[i] = testStr[i]; //is this correct?
This looks like an issue all right.
JNI types:
byte jbyte signed 8 bits
char jchar unsigned 16 bits
ICU4C types:
Define UChar to be wchar_t if that is
16 bits wide; always assumed to be
unsigned.
If wchar_t is not 16 bits wide, then
define UChar to be uint16_t or
char16_t because GCC >=4.4 can handle
UTF16 string literals. This makes the
definition of UChar platform-dependent
but allows direct string type
compatibility with platforms with
16-bit wchar_t types.
So, aside from anything icu_function might be doing, you are trying to fit a 16-bit value into an 8-bit-wide type.
If you must use a Java byte array, I suggest converting to the 8-bit char type by transcoding to a Unicode encoding.
To paraphrase some C code:
UChar *utf16 = (UChar*) malloc(len16 * sizeof(UChar));
//TODO: fill data
// convert to UTF-8
UConverter *encoding = ucnv_open("UTF-8", &status);
int len8 = ucnv_fromUChars(encoding, NULL, 0, utf16, len16, &status);
char *utf8 = (char*) malloc(len8 * sizeof(char));
ucnv_fromUChars(encoding, utf8, len8, utf16, len16, &status);
ucnv_close(encoding);
//TODO: char to jbyte
You can then transcode this to a Java String using new String(myICUBytes, "UTF-8").
I used UTF-8 because it was already in my sample code and you don't have to worry about endianness. Convert my C to C++ as appropriate.

Have you considered using ICU4J?
Also, when converting your bytes to a string, you will need to specify a character encoding. I'm not familiar with the library in question, so I can't advise you further, but perhaps this will be "UTF-16" or similar?
Oh, and it's also worth noting that you might simply be getting display errors because the terminal you're printing to isn't using the correct character set and/or doesn't have the right glyphs available.

Related

How to convert each char in string to 8 bit int? JAVA

I've been suggested a TCP-like checksum, which consists of the sum of the (integer) sequence and ack field values, added to a character-by-character sum of the payload field of the packet (i.e., treat each character as if it were an 8 bit integer and just add them together).
I'm assuming it would go along the lines of:
char[] a = data.toCharArray();
for (int i = 0; int < len; i++) {
...
}
Though I'm pretty clueless as to how I could complete the actual conversion?
My data is string, and I wish to go through the string (converted to a char array (though if there's a better way to do this let me know!)) and now I'm ready to iterate though how does one convert each character to an int. I will then be summing the total.
As String contains Unicode, and char is a two-byte UTF-16 implementation of Unicode, it might be better to first convert the String to bytes:
byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
data = new String(bytes, StandardCharsets.UTF_8); // Inverse.
int crc = 0;
for (byte b : bytes) {
int n = b & 0xFF; // An int 0 .. 255 without sign extension
crc ^= n;
}
Now you can handle any Unicode content of a String. UTF-8 is optimal when sufficient ASCII letters are used, like Chinese HTML pages. (For a Chinese plain text UTF-16 might be better.)

Getting true UTF-8 characters in Java JNI

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
Unfortunately GetStringUTFChars() almost does what's required but not quite, it returns a "modified" UTF-8 byte sequence. The main difference is that a modified UTF-8 doesn't contain any null characters (so you can treat is an ANSI C null terminated string) but another difference seems to be how Unicode supplementary characters such as emoji are treated.
A character such as U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES" is stored as a surrogate pair (two UTF-16 characters U+D83D U+DE04) and has a 4-byte UTF-8 equivalent of F0 9F 98 84, and that is the byte sequence that I get if I convert the string to UTF-8 in Java:
char[] c = Character.toChars(0x1F604);
String s = new String(c);
System.out.println(s);
for (int i=0; i<c.length; ++i)
System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));
byte[] b = s.getBytes("UTF-8");
for (int i=0; i<b.length; ++i)
System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF));
The code above prints the following:
😄
c[0] = 0xd83d
c[1] = 0xde04
b[0] = 0xf0
b[1] = 0x9f
b[2] = 0x98
b[3] = 0x84
However, if I pass 's' into a native JNI method and call GetStringUTFChars() I get 6 bytes. Each of the surrogate pair characters is being converted to a 3-byte sequence independently:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const char* sBytes = env->GetStringUTFChars(_s, NULL);
for (int i=0; sBytes[i]!=0; ++i)
fprintf(stderr, "%d: %02x\n", i, sBytes[i]);
env->ReleaseStringUTFChars(_s, sBytes);
return result;
}
0: ed
1: a0
2: bd
3: ed
4: b8
5: 84
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8. That in turn causes my native Mac code to crash because it's not a valid UTF-8 sequence:
CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8);
CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false);
I suppose I could change all my JNI methods to take a byte[] rather than a String and do the UTF-8 conversion in Java but that seems a bit ugly, is there a better solution?
This is clearly explained in the Java documentation:
JNI Functions
GetStringUTFChars
const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Modified UTF-8
The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.
All characters in the range \u0001 to \u007F are represented by a single byte, as follows:
The seven bits of data in the byte give the value of the character represented.
The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:
The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).
Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:
The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.
Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:
The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.
The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.
There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.
Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.
So, to answer your question...
Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
You can either:
Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.
Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const jclass stringClass = env->GetObjectClass(_s);
const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
const jstring charsetName = env->NewStringUTF("UTF-8");
const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
env->DeleteLocalRef(charsetName);
const jsize length = env->GetArrayLength(stringJbytes);
const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL);
for (int i = 0; i < length; ++i)
fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT);
env->DeleteLocalRef(stringJbytes);
}
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8
Java's Modified UTF-8 is not exactly the same as CESU-8:
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

char[] into ascii and decimal value (double) java

based on this array :
final char[] charValue = { 'u', ' ', '}','+' };
i want to print the double value and the ascii value from it in Java.
i can't find a proper solution for that in internet. I just found how to convert a single Character into Integer value. But what about many characters?
the main problem is, i have a large char[] and some double and int values are stored in. for double values they are stored within 4 bytes size and integer 1 or 2 bytes so i have to read all this and convert into double or integer.
Thanks for you help
When java was designed, there was C char being used for binary bytes and text.
Java made a clear separation between binary data (byte[], InputStream/OutputStream) and Unicode text (char, String, Reader/Writer). Hence Java has full Unicode support. The binary data, byte[], need information: their used encoding, in order to be convertable to text: char[]/String.
In Java a char[] will rarely be used (as in C/C++), and it seems byte[] is intended, as you mention 4 elements to be used for an int etcetera. A char is 16 bits, containing UTF-16 text.
For this case one can use a ByteBuffer either wrapping a byte[] or being taken from a memory mapped file.
Writing
ByteBuffer buf = ByteBuffer.allocate(13); // 13 bytes
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
buf.putInt(42); // at 0
buf.putDouble(Math.PI); // at 4
buf.put((byte) '0'); // at 12
buf.putDouble(4, 3.0); // at 4 overwrite PI
byte[] bytes = buf.array();
Reading
ByteBuffer buf = ByteBuffer.wrap(bytes);
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
int a = buf.getInt();
double b = buf.getDouble();
byte c = buf.get();

Return Arabic from JNI call

I have been trying to return an ARABIC string from a JNI call.
The java method is as follows
private native String ataTrans_CheckWord(String lpszWord, String lpszDest, int m_flag, int lpszReserved);
lpszWord : Input English
lpszDest : Ignore
m_flag : Ignore
lpszReserved :Ignore
Now when I use javah to generate the header file I get a C++ header file with this signature
JNIEXPORT jstring JNICALL Java_MyClass_ataTrans_1CheckWord (JNIEnv* env, jobject, jstring, jstring, jint , jint)
Now in this C++ code I have statements such as this
JNIEXPORT jstring JNICALL Java_MyClass_ataTrans_1CheckWord(JNIEnv* env, jobject, jstring jstrInput, jstring, jint , jint)
{
char aa[10];
char* bb;
char** cc;
bb = aa;
cc = &bb;
jstring tempValue;
const char* strCIn = (env)->GetStringUTFChars(jstrInput , &blnIsCopy);
int retVal = pDllataTrans_CheckWord(strCIn, cc, m_flag, lpszReserved);
printf("Orginal Arabic Conversion Index 0: %s \n",cc[0]); //This prints ARABIC properly
tempValue = (env)->NewString((jchar* )cc[0],10); // convert char array to jstring
printf("JSTRING UNICODE Created : %s \n",tempValue); //This prints junk
return tempValue;
}
I believe the ARABIC content is inside the pointer to a pointer “cc”. Finally in my java code I have a call like this
String temp = myclassInstance.ataTrans_CheckWord("ABCDEFG", "",1, 0);
System.out.println("FROM JAVE OUTPUT : "+temp); //Prints Junk
I just can’t get to return some ARABIC character out into my JAVA code. Is there something wrong I am doing? I have tried out various other alternates such as
tempValue = env->NewStringUTF("شسيشسيشسيشس");
and return tempValue but no luck. Its always garbage on the JAVA side.
Java strings are internally UTF-16, an encoding which uses 2 or 4 bytes per character. Your translation system seems to return strings encoded in a MBCS (Multi-Byte Character Set) - 1-N bytes per character.
The JNI NewString function expects data encoded as UTF-16, and you're passing it something else - so in java you get garbage data. The one thing that is lacking from your information is which encoding your translation system uses. I'll assume it's UTF-8, and use MultiByteToWideChar to convert to the format java expects. The below code assumes that you're doing this on Windows - if not, specify platform, and look at e.g. the iconv library.
int Len = strlen(cc[0])*2+2;
wchar_t * Buffer = (wchar_t *) malloc(Len);
MultiByteToWideChar(CP_UTF8, 0, cc[0], -1, Buffer, Len);
tempValue = (env)->NewString((jchar* )Buffer,wcslen(Buffer));
free(Buffer);
If you get strings as some other codepage, replace CP_UTF8 above.
As a side note, if the encoding actually is UTF-8, you can simply pass your cc[0] to NewStringUTF instead - This function handles the UTF-8 to UTF-16 conversion internally.

Convert INT(0-255) to UTF8 char in Java

since I need to control some devices, I need to send some bytes to them. I'm creating those bytes by putting some int values together (and operator), creating a byte and finally attaching it to a String to send it over the radio function to the robot.
Unfortuantely Java has some major issues doing that (unsigned int problem)
Does anybody know, how I can convert an integer e.g.
x = 223;
to an 8-bit character in Java to attach it to a String ?
char = (char)x; // does not work !! 16 bit !! I need 8 bit !
A char is 16-bit in Java. Use a byte if you need an 8-bit datatype.
See How to convert Strings to and from UTF8 byte arrays in Java on how to convert a byte[] to String with UTF-8 encoding.
Sending a java.lang.String over the wire is probably the wrong approach here, since Strings are always 16-bit (since Java was designed for globalization and stuff). If your radio library allows you to pass a byte[] instead of a String, that will allow you to send 8-bit values without needing to worry about converting to UTF8. As far as converting from an int to an unsigned byte, you'll probably want to look at this article.
int to array of bytes
public byte[] intToByteArray(int num){
byte[] intBytes = new byte[4];
intBytes[0] = (byte) (num >>> 24);
intBytes[1] = (byte) (num >>> 16);
intBytes[2] = (byte) (num >>> 8);
intBytes[3] = (byte) num;
return intBytes;
}
note endianness here is big endian.

Categories