Accessing Java byte[] in C++

Accessing Java byte[] in C++ - java

I have a Java application that persists byte[] structures to a DB (using Hibernate). I'm writing a C++ application that reads these structures at a later time.
As you might expect, I'm having problems.... The byte[] structure written to the DB is longer than the original number of bytes - 27 bytes, by the looks of things.
Is there a way of determining (in C++) the byte[] header structure to properly find the beginning of the true data?
I spent some time looking at the JNI source code (GetArrayLength(jbytearray), etc.) to determine how that works, but got quickly mired in the vagaries of JVM code. yuck...
Ideas?

The object is probably being serialized using the Java Object Serialization Protocol. You can verify this by looking for the magic number 0xACED at the beginning. If this is the case, it's just wrapped with some meta information about the class and length, and you can easily parse the actual byte values off the end.
In particular, you would see 0xAC 0xED 0x00 0x05 for the header followed by a classDesc element that would be 0x75 ...bytes... 0x70, followed by a 4 byte length, and the then the bytes themselves. Java serializes the length and other multibyte values in big-endian format.

Related

Java ByteBuffer.wrap(byte[]), what happens to endianness if byte[] is ordered in little endian

I try to understand ByteBuffer.wrap(byte[]) or even ByteBuffer in general:
If I have a byte array which contains some values of various length and different type (for example int16s, int32s, UTF-16 strings all in LITTLE ENDIAN byte order and some ASCII strings aswell) and then wrap it with a ByteBuffer and send it across the network, let's say via an AsynchronousSocketChannel, in which order are my bytes sent then?
Does it send them in BIG ENDIAN? Does it look at the byte array as one big data and changes it's order to big endian or does it perceive the byte order and only adds new elements with big endian byte order?
The background is that I am dealing with a client that sends and receives bytes in little endian order and it seems that it can't deal with the data which I send across the network.

If the data is in little endian within the wrapped buffer then it remains in little endian order. If you add integers values then it depends on the order of the buffer, which defaults to big endian or "network order".
The byte order of the buffer instance only matters when primitive values are read or written using the various get and set methods, such as getInt and putInt (with or without position). The buffer and the data already stored within stays untouched if the byte order is changed.
Basically when retrieving data from the buffer, the byte with the lowest index is retrieved (and send) first, then the next one, etc. Commonly that's thought of as the leftmost byte. The position, in other words, always goes up when bytes are retrieved from the buffer and send, until the limit is reached.

What happens if a file doesn't end exactly at the last byte?

For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?

You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.

Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.

Serialization: Converting bytes to bytes?

The object itself is a sequence of bytes and that is how does the machine understand all the data, whether it's object, text, images..etc. Could you clear this idea for me why we are converting a sequence of bytes (object) into another byte? Do we restructure the bytes when we do serialization, or make a template that holds this object to give it a special meaning when transmitted over the network? suppose a certain method, that takes the object from memory as it is, and put that object into an IP datagrams and send it through the network, what issue that may arise?

First: compression.
You must understand, that image file on disk and image file rendered from memory - are not the same. On disk they (usually, forget about BMP) are compressed. With current network throughput and hdd's capacities, compressing is essential.
Second: architecture.
Number in memory is just a sequence of bits, yes. But, what bit-count is counted as number? 8? 16? 32? 64? Any of them. There are bytes, words, integers, longs, floats (hell, floats!) and another couple of dozens of them. And bitorder also matters, so-called big-endian and little-endian. So 123456789 on one (x86) machine is not the same number on another machine (x64, for example).
Third: file (read: transmission) format != object-in-memory format.
Well, there is difference between data structure in file (or network packet), and when object loaded from that file in memory. And additionally, object-in-memory structure can differ from program version to version. Loaded-to-memory image in Win 3.1 and, f.e., Vista is a hell big difference. Also, structures packing and 4-, 8-, 16-, 32-bit-boundary aligning etc, etc, etc.

The object itself includes many references, which are pointers to where another component of the object happens to exist in memory on this particular machine at this particular moment. The point of serialization is that it converts objects into bytes that can be read at some other time, possibly on some other machine.
Additionally, object representations in memory are optimized for fast access and modification, not necessarily taking the minimum number of bytes. Some serialization protocols, especially for use in RPCs or data storage, optimize for how many bytes have to be transmitted or stored using compression algorithms that make it more difficult to access or modify the properties of the object in exchange for using fewer bytes to do it.

The object itself is a sequence of bytes
No. The object itself isn't just a 'sequence of bytes', unless it contains nothing but primitive data. It can contain
references to other objects
those objects may already have been serialized, in which case a back-reference needs to be serialized, not the referenced object all over again
those references may be null
there may be no object at all, just primitive data
All these things increase the complexity of the task well beyond the naive notion of just serializing 'a sequence of bytes'.

Difference between storing images in byte array and binary (BLOB) and which one is faster

I want to insert and select images from sql server in jdbc. I am confused whether BLOB and byte are the same thing or different. I have used Blob in my code and the application loads slow as it has to select the images stored in Blob and convert it pixel by pixel. I want to use byte array but I don't know whether they are same or different. My main aim is to load the image faster.
Thank you

Before going further, we may need to remember about basic concepts about bit, byte and binary, BLOB.
Bit: Abbreviation of binary digit. It is the smallest storage unit. Bits can take values of 0 or 1.
Byte: Second smallest storage which is commonly (nibble is not mentioned since it is not very common term) used. It includes eight bits.
Binary: Actually, it is a numbering scheme that each digit of a number can take a value of 0 or 1.
BLOB: Set of binary data stored in a database. Also, type of a column which stores binary data inside.
To sum up definitions: Binary format is a scheme that which include bits.
To make it more concrete, we can observe results with the code below.
import java.nio.ByteBuffer;
public class TestByteAndBinary{
public static void main(String []args){
String s = "test"; //a string, series of chars
System.out.println(s);
System.out.println();
byte[] bytes = s.getBytes(); //since each char has a size of 1 byte, we will have an array which has 4 elements
for(byte b : bytes){
System.out.println(b);
}
System.out.println();
for(byte b : bytes){
String c = String.format("%8s", Integer.toBinaryString(b)).replace(' ', '0'); //each element is printed in its binary format
System.out.println(c);
}
}
}
Output:
$javac TestByteAndBinary.java
$java -Xmx128M -Xms16M TestByteAndBinary
test
116
101
115
116
01110100
01100101
01110011
01110100
Let's go back to the question:
If you really want to store an image inside a database, you have to use the BLOB type.
BUT! It is not the best practice.
Because databases are designed to store data and filesystems are
designed to store the files.
Reading image from disk is a simple thing. But reading an image from
the database need more time to accomplished (querying data,
transforming to an array and vice versa).
While an image is being read, it will cause the database to suffer
from lower performance since it is not simple textual or numerical read.
An image file doesn't benefit from characteristical features of a database (like indexing)
At this point, it is best practice to store that image on a server and store its path on the database.
As far as I can see on enterprise level projects, images are very very rarely stored inside the database. And it is the situation that those images were needed to store encrypted since they were including very sensual data. According to my humble opinion, even in that situation, those data had not to be stored in a database.

Blob simply means (Binary Large Object) and its the way database stores byte array.
hope this is simple and it answers your question.

Base64 Encoding encrypted password hashes

I'm Currently creating a web application that requires passwords to be encrypted and stored in a database. I found the following Guide that encrypts passwords using PBKDF2WithHmacSHA1.
In the example provided the getEncryptedPassword method returns a byte array.
Are there any advantages in doing Base64 encoding the result?
Any Disadvantages?

The byte[] array is the smallest mechanism for storing the value (storage space wise). If you have lots of these values it may make sense to store them as bytes. Depending on where you store the result, the format may make a difference too. Most databases will accomodate byte[] values fairly well, but it can be cumbersome (depending on the database). Stores like text files and XML documents, etc. obviously will struggle with the byte[] array.
In most circumstances I feel there are two formats that make sense, Hexadecimal representation, or byte[]. I seldom think that the advantages of Base64 for short values (less than 32 characters) are worth it (for larger items then sure, use base64, and there's a fantastic library for it too).
This is obviously all subjective.....
Converting values to Hexadecimal are quite easy: see How to convert a byte array to a hex string in Java?
Hex output is convenient, and easier to manage than Base64 which has a more complicated algorithm to build, and is thus slightly slower.....

Assuming a reasonable database there is no advantage, since it's just an encoding scheme. There is a size increase as a consequence of base 64 encoding, which is a disadvantage. If your database reliably stores 8-bit bytes, just store the hash in the database.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.