does using String in java to hold binary data is wrong? - java

I need to pass binary data (red from a file) from java to c++ (using jni), so I have a C++ function that expects string (because in c++ string is just char array).
I read my binary file in java using the following code :
byte[] buffer = new byte[512];
FileInputStream in = new FileInputStream("some_file");
int rc = in.read(buffer);
while(rc != -1)
{
// rc should contain the number of bytes read in this operation.
// do stuff...
// next read
rc = in.read(buffer);
String s = new String(buffer);
// here i call my c++ function an pass "s"
}
I'm worried about the line that creates the string, what actually happens when i put the buffer inside a string ? It seems that when the data arrives to my c++ code it is different from what i expect him to be.
does the "string" constructor changes the data somehow ?

Strings are not char arrays at all. They are complex Unicode beasts with semantic interactions between the codepoints, different binary encodings, etc. This is true for all programs. The only thing that's different about C++ is that they haven't finished complaining and started doing things about it yet.
In all languages, for binary data, use an explicit binary data type, like array of bytes.

A C++ char is a Java byte. Both are 8-bit. A Java char is a 16-bit value.
Ignore that C++ calls it char. Give it a Java byte[].

Related

Implement fread (readInt) in java

I am attempting to convert a program that reads a binary file in C++ to java. The file is in little-endian.
fread(n, sizeof (unsigned), 1, inputFile);
The snippet above of c++ reads 1 integer into the integer variable 'n'.
I am currently using this method to accomplish the same thing:
public static int readInt(RandomAccessFile inputStream) throws IOException {
int retVal;
byte[] buffer = new byte[4];
inputStream.readFully(buffer);
ByteBuffer wrapped = ByteBuffer.wrap(buffer);
wrapped.order(ByteOrder.LITTLE_ENDIAN);
retVal = wrapped.getInt();
return retVal;
}
but this method sometimes differs in its result to the c++ example. I haven't been able to determine which parts of the file cause this method to fail, but I know it does. For example, when reading one part of the file my readInt method returns 543974774 but the C++ version returns 1.
Is there a better way to read little endian values in Java? Or is there some obvious flaw in my implementation? Any help understanding where I could be going wrong, or how could I could read these values in a more effective way would be very appreciated.
Update:
I am using RandomAcccessFile because I frequently require fseek functionality which RandomAccessFile provides in java
543974774 is, in hex, 206C6576.
There is no endianness on the planet that turns 206C6576 into '1'. The problem is therefore that you aren't reading what you think you're reading: If the C code is reading 4 bytes (or even a variable, unknown number of bytes) and turns that into '1', then your java code wasn't reading the same bytes - your C code and java code is out of sync: At some point, your C code read, for example, 2 bytes, and your java code read 4 bytes, or vice versa.
The problem isn't in your readInt method - that does the job properly every time.

How to convert "java.nio.HeapByteBuffer" to String

I have a data structure java.nio.HeapByteBuffer[pos=71098 lim=71102 cap=94870], which I need to convert into Int (in Scala), the conversion might look simple but whatever which I approach , i did not get right conversion. could you please help me?
Here is my code snippet:
val v : ByteBuffer= map.get("company").get
val utf_str = new String(v, java.nio.charset.StandardCharsets.UTF_8)
println (utf_str)
the output is just "R" ??
I can't see how you can even get that to compile, String has constructors that accepts another string or possibly an array, but not a ByteBuffer or any of its parents.
To work with the nio buffer api you first write to a buffer, then do a flip before you read from the buffer, there are lots of good resources online about that. This one for example: http://tutorials.jenkov.com/java-nio/buffers.html
How to read that as a string entirely depends on how the characters are encoded inside the buffer, if they are two bytes per character (as strings are in Java/the JVM) you can convert your buffer to a character buffer by using asCharBuffer.
So, for example:
val byteBuffer = ByteBuffer.allocate(7).order(ByteOrder.BIG_ENDIAN);
byteBuffer.putChar('H').putChar('i').putChar('!')
byteBuffer.flip()
val charBuffer = byteBuffer.asCharBuffer
assert(charBuffer.toString == "Hi!")

How to convert byte array in String format to byte array?

I have created a byte array of a file.
FileInputStream fileInputStream=null;
File file = new File("/home/user/Desktop/myfile.pdf");
byte[] bFile = new byte[(int) file.length()];
try {
fileInputStream = new FileInputStream(file);
fileInputStream.read(bFile);
fileInputStream.close();
}catch(Exception e){
e.printStackTrace();
}
Now,I have one API, which is expecting a json input, there I have to put the above byte array in String format. And after reading the byte array in string format, I need to convert it back to byte array again.
So, help me to find;
1) How to convert byte array to String and then back to the same byte array?
The general problem of byte[] <-> String conversion is easily solved once you know the actual character set (encoding) that has been used to "serialize" a given text to a byte stream, or which is needed by the peer component to accept a given byte stream as text input - see the perfectly valid answers already given on this. I've seen a lot of problems due to lack of understanding character sets (and text encoding in general) in enterprise java projects even with experienced software developers, so I really suggest diving into this quite interesting topic. It is generally key to keep the character encoding information as some sort of "meta" information with your binary data if it represents text in some way. Hence the header in, for example, XML files, or even suffixes as parts of file names as it is sometimes seen with Apache htdocs contents etc., not to mention filesystem-specific ways to add any kind of metadata to files. Also, when communicating via, say, http, the Content-Type header fields often contain additional charset information to allow for correct interpretation of the actual Contents.
However, since in your example you read a PDF file, I'm not sure if you can actually expect pure text data anyway, regardless of any character encoding.
So in this case - depending on the rest of the application you're working on - you may want to transfer binary data within a JSON string. A common way to do so is to convert the binary data to Base64 and, once transferred, recover the binary data from the received Base64 string.
How do I convert a byte array to Base64 in Java?
is a good starting point for such a task.
String class provides an overloaded constructor for this.
String s = new String(byteArray, "UTF-8");
byteArray = s.getBytes("UTF-8");
Providing an explicit encoding charset is encouraged because different encoding schemes may have different byte representations. Read more here and here.
Also, your inputstream maynot read all the contents in one go. You have to read in a loop until there is nothing more left to be read. Read the documentation. read() returns the number of bytes read.
Reads up to b.length bytes of data from this input stream into an
array of bytes. This method blocks until some input is available
String.getBytes() and String(byte[] bytes) are methods to consider.
Convert byte array to String
String s = new String(bFile , "ISO-8859-1" );
Convert String to byte array
byte bArray[] =s.getBytes("ISO-8859-1");

How to convert Java String to C++ String using bytes as the medium

What would be the algorithm/implementation of the C++ code C++functionX in the following flow chart:
(JavaString) --getBytes--> (bytes) --C++functionX--> (C++String)
JavaString contents should match C++String contents as far as possible (preferably 100% for all possible values of JavaString)
[EDIT] The endianness of bytes can be ignored as there are ways to handle that.
Java:
String original = new String("BANANAS");
byte[] utf8Bytes = original.getBytes("UTF8");
//save the length as a 32 bit integer, then utf8 Bytes to a file
C++:
int32_t tlength;
std::string utf8Bytes;
//load the tlength as a 32 bit integer, then the utf8 bytes from the file
//well, that's easy for UTF8
//to turn that into a utf-18 string in windows
int wlength = MultiByteToWideChar(CP_UTF8, 0, utf8Bytes.c_str(), utf8Bytes.size(), nullptr, 0);
std::wstring result(wlength, '\0');
MultiByteToWideChar(CP_UTF8, 0, utf8Bytes.c_str(), utf8Bytes.size(), &result[0], wlength);
//so that's not hard either
To do this in linux, one uses the iconv library, which is incredibly powerful, but more difficult to use. Here's a function that converts a std::string in UTF8 to a std::wstring in UTF32: http://coliru.stacked-crooked.com/view?id=986a4a07e391213559d4e65acaf231d5-e54ee7a04e4b807da0930236d4cc94dc
There's no such thing as One True C++ String class. STL alone has std::string and std::wstring. That said, most string classes have a constructor that takes raw byte pointer as a parameter. The bytes come in the const char * form. So, a good example of your C++functionX is the constructor std::string::string(const char*, int).
Note the encoding issues. getBytes() takes an encoding as a parameter; you better match that on the C++ side, or you'll get jumble. If not sure, use UTF-8.
Depending on what kinds of Java strings you have, you might want to choose either regular or wide strings (e. g. std::wstring). The latter is a slightly better representation of what Java String offers.
C++, as far as the standard goes, doesn't know about encodings. Java does. So, to interface the two, make Java emit some well-defined encoding, such as UTF8:
byte[] utf8str = str.getBytes("UTF8");
In C++, use a library such as iconv() to transform the UTF8-string either into another string of a well-defined encoding (e.g. std::u32string with UTF-32, if you have C++11, or std::basic_string<uint32_t> or std::vector<uint32_t> otherwise), or, alternatively, convert it to WCHAR_T encoding, to be stored in a std::wstring, and proceed further to convert this to a multi-byte string via the standard function wcstombs() if you wish to interface with your environment.
The choice depends on what you need to do with the string. For serialization or text processing, go with the definite encoding (e.g. UTF-32). For writing to the standard output using the system's locale, use the multibyte conversion. (Here is a slightly longer discussion of encodings in C++.)
the C++ string should probably be an std::wstring instance and you would alse need to keep track of the encoding you would use to transform from JavaString to bytes.
This article will probably help you more:
std::wstring VS std::string

A C structure accessed in Java

I have a C structure that is sent over some intermediate networks and gets received over a serial link by a java code. The Java code gives me a byte array that I now want to repackage it as the original structure. Now if the receive code was in C, this was simple. Is there any simple way to repackage a byte[] in java to a C struct. I have minimal experience in java but this doesnt appear to be a common problem or solved in any FAQ that I could find.
FYI the C struct is
struct data {
uint8_t moteID;
uint8_t status; //block or not
uint16_t tc_1;
uint16_t tc_2;
uint16_t panelTemp; //board temp
uint16_t epoch#;
uint16_t count; //pkt seq since the start of epoch
uint16_t TEG_v;
int16_t TEG_c;
}data;
I would recommend that you send the numbers across the wire in network byte order all the time. This eliminates the problems of:
Compiler specific word boundary generation for your structure.
Byte order specific to your hardware (both sending and receiving).
Also, Java's numbers are always stored in network-byte-order no matter the platform that you run Java upon (the JVM spec requires a specific byte order).
A very good class for extracting bits from a stream is java.nio.ByteBuffer, which can wrap arbitrary byte arrays; not just those coming from a I/O class in java.nio. You really should not hand code your own extraction of primitive values if at all possible (i.e. bit shifting and so forth) since it is easy to get this wrong, the code is the same for every instance of the same type, and there are plenty of standard classes that provide this for you.
For example:
public class Data {
private byte moteId;
private byte status;
private short tc_1;
private short tc_2;
//...etc...
private int tc_2_as_int;
private Data() {
// empty
}
public static Data createFromBytes(byte[] bytes) throws IOException {
final Data data = new Data();
final ByteBuffer buf = ByteBuffer.wrap(bytes);
// If needed...
//buf.order(ByteOrder.LITTLE_ENDIAN);
data.moteId = buf.get();
data.status = buf.get();
data.tc_1 = buf.getShort();
data.tc_2 = buf.getShort();
// ...extract other fields here
// Example to convert unsigned short to a positive int
data.tc_2_as_int = buf.getShort() & 0xffff;
return data;
}
}
Now, to create one, just call Data.createFromBytes(byteArray).
Note that Java does not have unsigned integer variables, but these will be retrieved with the exact same bit pattern. So anything where the high-order bit is not set will be exactly the same when used. You will need to deal with the high-order bit if you expected that in your unsigned numbers. Sometimes this means storing the value in the next larger integer type (byte -> short; short -> int; int -> long).
Edit: Updated the example to show how to convert a short (16-bit signed) to an int (32-bit signed) with the unsigned value with tc_2_as_int.
Note also that if you cannot change the byte-order and it is not in network order, then java.nio.ByteBuffer can still serve you here with buf.order(ByteOrder.LITTLE_ENDIAN); before retrieving the values.
This can be difficult to do when sending from C to C.
If you have a data struct, cast it so that you end up with an array of bytes/chars and then you just blindly send it you can sometimes end up with big problems decoding it on the other end.
This is because sometimes the compiler has decided to optimize the way that the data is packed in the struct, so in raw bytes it may not look exactly how you expect it would look based on how you code it.
It really depends on the compiler!
There are compiler pragma's you can use to make packing unoptimized. See C/C++ Preprocessor Reference - pack
The other problem is the 32/64-bit bit problem if you just use "int", and "long" without specifying the number of bytes... but you have done that :-)
Unfortunately, Java doesnt really have structs... but it represents the same information in classes.
What I recommend is that you make a class that consists of your variables, and just make a custom unpacking function that will pull the bytes out from the received packet (after you have checked its correctness after transfer) and then load them in to the class.
e.g. You have a data class like
class Data
{
public int moteID;
public int status; //block or not
public int tc_1;
public int tc_2;
}
Then when you receive a byte array, you can do something like this
Data convertBytesToData(byte[] dataToConvert)
{
Data d = Data();
d.moteId = (int)dataToConvert[0];
d.status = (int)dataToConvert[1];
d.tc_1 = ((int)dataToConvert[2] << 8) + dataTocConvert[3]; // unpacking 16-bits
d.tc_2 = ((int)dataToConvert[4] << 8) + dataTocConvert[5]; // unpacking 16-bits
}
I might have the 16-bit unpacking the wrong way around, it depends on the endian of your C system, but you'll be able to play around and see if its right or not.
I havent played with Java for sometime, but hopefully there might be byte[] to int functions built in these days.
I know there are for C# anyway.
With all this in mind, if you are not doing high data rate transfers, definately look at JSON and Protocol Buffers!
Assuming you have control over both ends of the link, rather than sending raw data you might be better off going for an encoding that C and Java can both use. Look at either JSON or Protocol Buffers.
What you are trying to do is problematic for a couple of reasons:
Different C implementations will represent uint16_t (and int16_t) values in different ways. In some cases, the most significant byte will be first when the struct is laid out in memory. In other cases, the least significant byte will.
Different C compilers may pack the fields of the struct differently. So it is possible (for example) that the fields have been reordered or padding may have been added.
So what this all means is that you have to figure out exactly the struct is laid out ... and just hope that this doesn't change when / if you change C compilers or C target platform.
Having said that, I could not find a Java library for decoding arbitrary binary data streams that allows you to select "endian-ness". The DataInputStream and DataOutputStream classes may be the answer, but they are explicitly defined to send/expect the high order byte first. If your data comes the other way around you will need to do some Java bit bashing to fix it.
EDIT : actually (as #Kevin Brock points out) java.nio.ByteBuffer allows you to specify the endian-ness when fetching various data types from a binary buffer.

Categories