I am using JNI to call the native C++ layer.
java layer
int res= recog(audioFilePath, grammarFilePath, contextID, subContextID);
C++ layer
JNIEXPORT void JNICALL Java_com_uniphore_voice_recogniser_NuanceOfflineRecogniser_recog(JNIEnv *jenv, jobject jobj, jstring jaudioFilePath,
jstring jgrammarFilePath, <br/>
jstring jcontextID,
jstring jsubContext)
{
const char* _audioFilePath = (char*)jenv->GetStringChars(jaudioFilePath, JNI_FALSE);
const char* _grammarFilePath = (char*)jenv->GetStringChars(jgrammarFilePath, JNI_FALSE);
const char* _contextId = (char*)jenv->GetStringChars(jcontextID, JNI_FALSE);
const char* _subContextId = (char*)jenv->GetStringChars(jsubContext, JNI_FALSE);
std::wcout << "audio file path: " << _audioFilePath <<" "<< std::strlen(_audioFilePath) <<std::endl
<< "grammar file path: "<< _grammarFilePath <<" "<<std::strlen(_grammarFilePath) << std::endl
<< "contextId: " << _contextId << std::endl
<< "subContextId: " << _subContextId << std::endl << std::endl;
I can see at java layer values is properly passed to the lower level but in c++ layer while printing that value in C++ layer I can see it is printing only first character of whole string.
suppose if audioFilePath I am passing like "c:\test.wav" I am getting print in c++ layer only like c
I am trying in visual studio 2013 and project character support I selected as Unicode support.
I am new to c++ environment, please help to get the reason for this one.
According to JNI docs GetStringChars returns the unicode characters for the given string in a jchar * which is an unsigned short *. You cast it to a char *. When you use cout with a char * it expects a string in ASCII format with a null-terminator. You pass it a pointer to a string in unicode format, which has every other character 0 for plain ASCII characters. Hence why you only print the first character in the string.
GetStringChars is not returning a pointer to single byte characters, but two byte, unicode characters
const jchar * GetStringChars(JNIEnv *env, jstring string,
jboolean *isCopy);
Returns a pointer to the array of Unicode characters of the string.
Instead, try
GetStringUTFChars
This will be null terminated also.
Related
I am writing a JNI. In that, my Java program takes an Image byte array using ByteOutputStream() and then this array is used to call a function in C that converts byte array to unsigned char*. Here is the code:
JNIEXPORT void JNICALL Java_ImageConversion_covertBytes(JNIEnv *env, jobject obj, jbyteArray array)
{
unsigned char* flag = (*env)->GetByteArrayElements(env, array, NULL);
jsize size = (*env)->GetArrayLength(env, array);
for(int i = 0; i < size; i++) {
printf("%c", flag[i]);}
}
In this I keep getting a warning when I compile:
warning: initializing 'unsigned char *' with an expression of type 'jbyte *' (aka 'signed char *') converts between pointers to integer types with different sign [-Wpointer-sign]
unsigned char* flag = (*env)->GetByteArrayElements(env, array, NULL);
How can I remove this warning? I want to print the all characters.
The warning exists because the sign change might be important. In JNI the jbyte corresponds to Java byte which is a signed 8-bit integer; in C it is explicitly signed char.
However, it is OK to access any object with any character pointer, so you can cast to unsigned char explicitly:
unsigned char* flag = (unsigned char*)(*env)->GetByteArrayElements(env, array, NULL);
Alternatively, you can declare flag as signed char:
signed char* flag = (*env)->GetByteArrayElements(env, array, NULL);
This is fine for printf("%c\n", flag[i]); because %c requires that the argument be an integer; the integer is then converted to unsigned char so both signed and unsigned char will do.
However 3rd option would be to use neither - if you just want to write them to the terminal, use a void * pointer and fwrite:
JNIEXPORT void JNICALL
Java_ImageConversion_covertBytes(JNIEnv *env, jobject obj, jbyteArray array)
{
void *flag = (*env)->GetByteArrayElements(env, array, NULL);
jsize size = (*env)->GetArrayLength(env, array);
fwrite(flag, 1, size, stdout);
}
and let fwrite worry about the looping.
Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
Unfortunately GetStringUTFChars() almost does what's required but not quite, it returns a "modified" UTF-8 byte sequence. The main difference is that a modified UTF-8 doesn't contain any null characters (so you can treat is an ANSI C null terminated string) but another difference seems to be how Unicode supplementary characters such as emoji are treated.
A character such as U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES" is stored as a surrogate pair (two UTF-16 characters U+D83D U+DE04) and has a 4-byte UTF-8 equivalent of F0 9F 98 84, and that is the byte sequence that I get if I convert the string to UTF-8 in Java:
char[] c = Character.toChars(0x1F604);
String s = new String(c);
System.out.println(s);
for (int i=0; i<c.length; ++i)
System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));
byte[] b = s.getBytes("UTF-8");
for (int i=0; i<b.length; ++i)
System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF));
The code above prints the following:
😄
c[0] = 0xd83d
c[1] = 0xde04
b[0] = 0xf0
b[1] = 0x9f
b[2] = 0x98
b[3] = 0x84
However, if I pass 's' into a native JNI method and call GetStringUTFChars() I get 6 bytes. Each of the surrogate pair characters is being converted to a 3-byte sequence independently:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const char* sBytes = env->GetStringUTFChars(_s, NULL);
for (int i=0; sBytes[i]!=0; ++i)
fprintf(stderr, "%d: %02x\n", i, sBytes[i]);
env->ReleaseStringUTFChars(_s, sBytes);
return result;
}
0: ed
1: a0
2: bd
3: ed
4: b8
5: 84
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8. That in turn causes my native Mac code to crash because it's not a valid UTF-8 sequence:
CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8);
CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false);
I suppose I could change all my JNI methods to take a byte[] rather than a String and do the UTF-8 conversion in Java but that seems a bit ugly, is there a better solution?
This is clearly explained in the Java documentation:
JNI Functions
GetStringUTFChars
const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Modified UTF-8
The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.
All characters in the range \u0001 to \u007F are represented by a single byte, as follows:
The seven bits of data in the byte give the value of the character represented.
The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:
The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).
Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:
The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.
Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:
The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.
The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.
There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.
Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.
So, to answer your question...
Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?
You can either:
Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.
Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:
JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
const jclass stringClass = env->GetObjectClass(_s);
const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
const jstring charsetName = env->NewStringUTF("UTF-8");
const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
env->DeleteLocalRef(charsetName);
const jsize length = env->GetArrayLength(stringJbytes);
const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL);
for (int i = 0; i < length; ++i)
fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT);
env->DeleteLocalRef(stringJbytes);
}
The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8
Java's Modified UTF-8 is not exactly the same as CESU-8:
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).
Id like to make a simple function, that return value of two Strings, basically:
java
public native String getAppendedString(String name);
c
jstring Java_com_example_hellojni_HelloJni_getAppendedString(JNIEnv* env, jobject thiz, jstring s) {
jstring sx = (*env)->GetStringUTFChars(env, s, NULL);
return ((*env)->NewStringUTF(env, "asd ")+sx);
}
It returns:
jni/hello-jni.c:32: warning: initialization discards qualifiers from pointer target type
jni/hello-jni.c:34: error: invalid operands to binary + (have 'char *' and 'char *')
The retval will be: "asd qwer", how can I do this?
jstring s1 = (*env)->NewStringUTF(env, "456");
jstring s2 = (*env)->NewStringUTF(env, "123");
jstring sall=strcat(s1, s2);
return sall;
Only returns "456"
There are a few issues here:
GetStringUTFChars returns a jbyte * (a null-terminated C string), not a jstring. You need this C string to do string manipulation in C.
You need to call ReleaseStringUTFChars when you're done with it.
You need to allocate enough memory to hold the concatenated string, using malloc.
As ethan mentioned, you need to concatenate your two C strings with strcat. (You cannot do this with the + operator. When applied to a pointer, + returns the pointer from the offset of the original pointer.)
Remember to free the memory you allocated after you're done with it (ie, after it's been interned as a Java string.)
You should do something along the lines of:
char *concatenated;
const jbyte *sx;
jstring retval;
/* Get the UTF-8 characters that represent our java string */
sx = (*env)->GetStringUTFChars(env, s, NULL);
/* Concatenate the two strings. */
concatenated = malloc(strlen("asd ") + strlen(sx) + 1);
strcpy(concatenated, "asd ");
strcat(concatenated, sx);
/* Create java string from our concatenated C string */
retval = (*env)->NewStringUTF(env, concatenated);
/* Free the memory in sx */
(*env)->ReleaseStringUTFChars(env, s, sx);
/* Free the memory in concatenated */
free(concatenated);
return retval;
You can't concatenate two char* with + in c++. Try using strcat instead.
http://www.cplusplus.com/reference/clibrary/cstring/strcat/
EDIT:
from the documentation for strcat:
char * strcat ( char * destination, const char * source );
Concatenate strings
Appends a copy of the source string to the destination string. The terminating null character in destination is overwritten by the first character of source, and a new null-character is appended at the end of the new string formed by the concatenation of both in destination.
This means that the first argument to strcat needs to have enough memory allocated to fit the entire concatenated string.
I have been trying to return an ARABIC string from a JNI call.
The java method is as follows
private native String ataTrans_CheckWord(String lpszWord, String lpszDest, int m_flag, int lpszReserved);
lpszWord : Input English
lpszDest : Ignore
m_flag : Ignore
lpszReserved :Ignore
Now when I use javah to generate the header file I get a C++ header file with this signature
JNIEXPORT jstring JNICALL Java_MyClass_ataTrans_1CheckWord (JNIEnv* env, jobject, jstring, jstring, jint , jint)
Now in this C++ code I have statements such as this
JNIEXPORT jstring JNICALL Java_MyClass_ataTrans_1CheckWord(JNIEnv* env, jobject, jstring jstrInput, jstring, jint , jint)
{
char aa[10];
char* bb;
char** cc;
bb = aa;
cc = &bb;
jstring tempValue;
const char* strCIn = (env)->GetStringUTFChars(jstrInput , &blnIsCopy);
int retVal = pDllataTrans_CheckWord(strCIn, cc, m_flag, lpszReserved);
printf("Orginal Arabic Conversion Index 0: %s \n",cc[0]); //This prints ARABIC properly
tempValue = (env)->NewString((jchar* )cc[0],10); // convert char array to jstring
printf("JSTRING UNICODE Created : %s \n",tempValue); //This prints junk
return tempValue;
}
I believe the ARABIC content is inside the pointer to a pointer “cc”. Finally in my java code I have a call like this
String temp = myclassInstance.ataTrans_CheckWord("ABCDEFG", "",1, 0);
System.out.println("FROM JAVE OUTPUT : "+temp); //Prints Junk
I just can’t get to return some ARABIC character out into my JAVA code. Is there something wrong I am doing? I have tried out various other alternates such as
tempValue = env->NewStringUTF("شسيشسيشسيشس");
and return tempValue but no luck. Its always garbage on the JAVA side.
Java strings are internally UTF-16, an encoding which uses 2 or 4 bytes per character. Your translation system seems to return strings encoded in a MBCS (Multi-Byte Character Set) - 1-N bytes per character.
The JNI NewString function expects data encoded as UTF-16, and you're passing it something else - so in java you get garbage data. The one thing that is lacking from your information is which encoding your translation system uses. I'll assume it's UTF-8, and use MultiByteToWideChar to convert to the format java expects. The below code assumes that you're doing this on Windows - if not, specify platform, and look at e.g. the iconv library.
int Len = strlen(cc[0])*2+2;
wchar_t * Buffer = (wchar_t *) malloc(Len);
MultiByteToWideChar(CP_UTF8, 0, cc[0], -1, Buffer, Len);
tempValue = (env)->NewString((jchar* )Buffer,wcslen(Buffer));
free(Buffer);
If you get strings as some other codepage, replace CP_UTF8 above.
As a side note, if the encoding actually is UTF-8, you can simply pass your cc[0] to NewStringUTF instead - This function handles the UTF-8 to UTF-16 conversion internally.
I am accessing an ICU4C function through JNI which returns a UChar * (i.e. unicode character array).... I was able to convert that to jbyteArray by equating each member of the UChar array to a local jbyte[] array that I created and then I returned it to Java using the env->SetByteArrayRegion() function... now I have the Byte[] array in Java but it's all gibberish pretty much.. Weird symbols at best... I am not sure where the problem might be... I am working with unicode characters if that matters... how do I convert the byte[] to a char[] in java properly? Something is not being mapped right... Here is a snippet of the code:
--- JNI code (altered slighter to make it shorter) ---
static jint testFunction(JNIEnv* env, jclass c, jcharArray srcArray, jbyteArray destArray) {
jchar* src = env->GetCharArrayElements(srcArray, NULL);
int n = env->getArrayLength(srcArray);
UChar *testStr = new UChar[n];
jbyte destChr[n];
//calling ICU4C function here
icu_function (src, testStr); //takes source characters and returns UChar*
for (int i=0; i<n; i++)
destChr[i] = testStr[i]; //is this correct?
delete testStr;
env->SetByteArrayRegion(destArray, 0, n, destChr);
env->ReleaseCharArrayElements(srcArray, src, JNI_ABORT);
return (n); //anything for now
}
-- Java code --
string wohoo = "ABCD bal bla bla";
char[] myChars = wohoo.toCharArray();
byte[] myICUBytes = new byte[myChars.length];
int value = MyClass.testFunction (myChars, myICUBytes);
System.out.println(new String(myICUBytes)) ;// produces gibberish & weird symbols
I also tried: System.out.println(new String(myICUBytes, Charset.forName("UTF-16"))) and it's just as gebberishy....
note that the ICU function does return the proper unicode characters in the UChar *... somewheres between the conversion to jbyteArray and Java that is is messing up...
Help!
destChr[i] = testStr[i]; //is this correct?
This looks like an issue all right.
JNI types:
byte jbyte signed 8 bits
char jchar unsigned 16 bits
ICU4C types:
Define UChar to be wchar_t if that is
16 bits wide; always assumed to be
unsigned.
If wchar_t is not 16 bits wide, then
define UChar to be uint16_t or
char16_t because GCC >=4.4 can handle
UTF16 string literals. This makes the
definition of UChar platform-dependent
but allows direct string type
compatibility with platforms with
16-bit wchar_t types.
So, aside from anything icu_function might be doing, you are trying to fit a 16-bit value into an 8-bit-wide type.
If you must use a Java byte array, I suggest converting to the 8-bit char type by transcoding to a Unicode encoding.
To paraphrase some C code:
UChar *utf16 = (UChar*) malloc(len16 * sizeof(UChar));
//TODO: fill data
// convert to UTF-8
UConverter *encoding = ucnv_open("UTF-8", &status);
int len8 = ucnv_fromUChars(encoding, NULL, 0, utf16, len16, &status);
char *utf8 = (char*) malloc(len8 * sizeof(char));
ucnv_fromUChars(encoding, utf8, len8, utf16, len16, &status);
ucnv_close(encoding);
//TODO: char to jbyte
You can then transcode this to a Java String using new String(myICUBytes, "UTF-8").
I used UTF-8 because it was already in my sample code and you don't have to worry about endianness. Convert my C to C++ as appropriate.
Have you considered using ICU4J?
Also, when converting your bytes to a string, you will need to specify a character encoding. I'm not familiar with the library in question, so I can't advise you further, but perhaps this will be "UTF-16" or similar?
Oh, and it's also worth noting that you might simply be getting display errors because the terminal you're printing to isn't using the correct character set and/or doesn't have the right glyphs available.