Serialization: Converting bytes to bytes?

Serialization: Converting bytes to bytes? - java

The object itself is a sequence of bytes and that is how does the machine understand all the data, whether it's object, text, images..etc. Could you clear this idea for me why we are converting a sequence of bytes (object) into another byte? Do we restructure the bytes when we do serialization, or make a template that holds this object to give it a special meaning when transmitted over the network? suppose a certain method, that takes the object from memory as it is, and put that object into an IP datagrams and send it through the network, what issue that may arise?

First: compression.
You must understand, that image file on disk and image file rendered from memory - are not the same. On disk they (usually, forget about BMP) are compressed. With current network throughput and hdd's capacities, compressing is essential.
Second: architecture.
Number in memory is just a sequence of bits, yes. But, what bit-count is counted as number? 8? 16? 32? 64? Any of them. There are bytes, words, integers, longs, floats (hell, floats!) and another couple of dozens of them. And bitorder also matters, so-called big-endian and little-endian. So 123456789 on one (x86) machine is not the same number on another machine (x64, for example).
Third: file (read: transmission) format != object-in-memory format.
Well, there is difference between data structure in file (or network packet), and when object loaded from that file in memory. And additionally, object-in-memory structure can differ from program version to version. Loaded-to-memory image in Win 3.1 and, f.e., Vista is a hell big difference. Also, structures packing and 4-, 8-, 16-, 32-bit-boundary aligning etc, etc, etc.

The object itself includes many references, which are pointers to where another component of the object happens to exist in memory on this particular machine at this particular moment. The point of serialization is that it converts objects into bytes that can be read at some other time, possibly on some other machine.
Additionally, object representations in memory are optimized for fast access and modification, not necessarily taking the minimum number of bytes. Some serialization protocols, especially for use in RPCs or data storage, optimize for how many bytes have to be transmitted or stored using compression algorithms that make it more difficult to access or modify the properties of the object in exchange for using fewer bytes to do it.

The object itself is a sequence of bytes
No. The object itself isn't just a 'sequence of bytes', unless it contains nothing but primitive data. It can contain
references to other objects
those objects may already have been serialized, in which case a back-reference needs to be serialized, not the referenced object all over again
those references may be null
there may be no object at all, just primitive data
All these things increase the complexity of the task well beyond the naive notion of just serializing 'a sequence of bytes'.

Related

What happens if a file doesn't end exactly at the last byte?

For example, if a file is 100 bits, it would be stored as 13 bytes.This means that the first 4 bits of the last byte is the file and the last 4 is not the file (useless data).
So how is this prevented when reading a file using the FileInputStream.read() function in java or similar functions in other programming language?

You'll notice if you ever use assembly, there's no way to actually read a specific bit. The smallest addressable bit of memory is a byte, memory addresses refer to a specific byte in memory. If you ever use a specific bit, in order to access it you have to use bitwise functions like | & ^ So in this situation, if you store 100 bits in binary, you're actually storing a minimum of 13 bytes, and a few bits just default to 0 so the results are the same.

Current file systems mostly store files that are an integral number of bytes, so the issue does not arise. You cannot write a file that is exactly 100 bits long. The reason for this is simple: the file metadata holds the length in bytes and not the length in bits.
This is a conscious design choice by the designers of the file system. They presumably chose the design the way they do out of a consideration that there's very little need for files that are an arbitrary number of bits long.
Those cases that do need a file to contain a non-integral number of bytes can (and need to) make their own arrangements. Perhaps the 100-bit case could insert a header that says, in effect, that only the first 100 bits of the following 13 bytes have useful data. This would of course need special handling, either in the application or in some library that handled that sort of file data.
Comments about bit-lengthed files not being possible because of the size of a boolean, etc., seem to me to miss the point. Certainly disk storage granularity is not the issue: we can store a "100 byte" file on a device that can only handle units of 256 bytes - all it takes is for the file system to note that the file size is 100, not 256, even though 256 bytes are allocated to the file. It could equally well track that the size was 100 bits, if that were useful. And, of course, we'd need I/O syscalls that expressed the transfer length in bits. But that's not hard. The in-memory buffer would need to be slightly larger, because neither the language nor the OS allocates RAM in arbitrary bit-lengths, but that's not tied tightly to file size.

Why java socket use Byte as base structure rather then Bit?

I writing a protocol use Bit to represent Boolean, but java network not use the smaller structure. I want to know why network data designed as byte rather then bit? expose atom structure isn't better?

Because the fundamental packets of IP are defined in terms of bytes rather than bits. All packets are a whole number of bytes rather than bits. Even bit fields are part of a multi-byte field. Bytes, rather than bits, are fundamental to IP.
This is ultimately because computer memory does not have addressable bits, but rather has addressable bytes. Any implementation of a networking protocol would have to be based on bytes, not bits. And that that is also why Java does not provide direct access to bits.

The network bandwidth saving that could be achieved by carrying single bits of payload, compared to the added complexity both at the hardware and software level, simply does not worth it.
Fundamentally, both at hardware level (registers) and software level, the minimal unit for data handling is byte, 8 bits (or octet, if you want to be nitpicking) or multiple of that. You cannot address memory at the bit level, only at the multiple of a byte level. Doing otherwise would be very complicated, down to the silicium level, without added value.
Whatever the programing language, when you declare and use a boolean, a byte (or a power of 2 multiple number of bytes, why not as long ass I can load it from memory to a CPU register) will actually be used to store it and the language will take care that only 2 cases when using it: is this byte all 0 bits, or not. At the machine code/assembly level: load this byte from its memory address to register FOO, or multiple bytes (if for example 32 bits wide register), cmp FOO to 0, depending on the result JE (Jump If Egal) to code address BAR, else go on with next machine code line. Or JNE (Jump if Not Equal) to such other code address. So your Javan boolean is not actually stored as a bit. It's, at minimum, a byte.
Even the good old Ethernet frame, not even looking at the actual useful payload, starts by a 56-bit preamble to synchronize devices. 56 bits is 7 bytes. Could the synchronization be done with less than that? Not a number of bytes? Maybe, but that does not worth the effort.
https://en.wikipedia.org/wiki/Ethernet_frame#Preamble_and_start_frame_delimiter
Pedantic edit for nitpickers:
A language such as C have a bit field facility:
https://en.wikipedia.org/wiki/Bit_field
...but don't be fooled, the minimal storage unit at the silicum level for a bit from a bit field will still be a byte. Hence the "field" in "bit fields".

Data alignment vs. cache locality

From memory, data may only be read in the natural word size of the architecture. For example, on a 32-bit system, data is read from memory in 4-byte chunks. If a 2-byte or 1-byte value is added to memory, their reading will still require accessing a 4-byte word. (In case of the 2-byte value, two 4-byte accesses might be required, if the value was stored on a word boundary.)
Therefore, an access to an individual value is the fastest when it requires accessing a single word, and minimal additional work (such as masking) is required. If I'm correct, this is the reason virtual machines (such as JVM or Android's Dalvik) lay out member variables in 4-byte boundaries in Object instances.
Another concept is cache friendliness, i.e. locality (e.g. L1, L2). If many values must be traversed/processed directly after each other, it is beneficial that they are stored close to each other (ideally, in a contiguous block). This is spatial locality. If this isn't possible, at least the operations on the same value should be done in the same time period (temporal locality -- i.e. it has a high chance that the value is retained in cache while the operations are performed on it).
As far as I can see, the above two concepts can be "contradictory" in some cases, and the choice between them depends on their usage scenario. For example, a smaller amount of contiguous data is more cache friendly than a greater amount (trivial), yet if random access is commonly required on some data, the word-aligned (but greater-sized) structure might be beneficial -- unless the whole structure fits in the cache. Therefore, whether locality (~arrays) or alignment benefits should be preferred depends on how the values will be manipulated, I think.
There is a scenario which is interesting for me: let's assume a pathfinding algorithm which receives the input graph (and other auxiliary structures) as arrays. (Most of its input arrays store values that are <= 32767.)
The pathfinding algorithm performs very many random accesses on the arrays (in several loops). In this sense, an int[] might be desired for input data (on Android/ARM), because the values will be on word boundary when accessed. (On the other hand, if sequential traversals were needed, then a smaller datatype would be recommended -- especially for large arrays -- because of a higher probability of cache-friendliness.)
However, what if the (randomly accessed) input data would fit L1/L2 if specified as a short[], but not fit if specified as int[]? In such a case, would the advantage of 4-byte alignment of int[] for random access be outweighted by the cache-friendliness of short[]?
In a concrete application, of course, I'd make measurements for comparison. That wouldn't necessarily answer the above questions, however.

If you can assure that moving to short leads to a significatn better locality (aka everything is in cache), this outweights alignment penalties.
access to cache is in the low nanos <10ns, access to ram is 60-80ns

Is there a way to efficiently store a sequence of enum values in Java?

I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.

You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.

One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.

Anyone using short and byte primitive types, in real apps?

I have been programming in Java since 2004, mostly enterprise and web applications. But I have never used short or byte, other than a toy program just to know how these types work. Even in a for loop of 100 times, we usually go with int. And I don't remember if I have ever came across any code which made use of byte or short, other than some public APIs and frameworks.
Yes I know, you can use a short or byte to save memory in large arrays, in situations where the memory savings actually matters. Does anyone care to practice that? Or its just something in the books.
[Edited]
Using byte arrays for network programming and socket communication is a quite common usage. Thanks, Darren, to point that out. Now how about short? Ryan, gave an excellent example. Thanks, Ryan.

I use byte a lot. Usually in the form of byte arrays or ByteBuffer, for network communications of binary data.
I rarely use float or double, and I don't think I've ever used short.

Keep in mind that Java is also used on mobile devices, where memory is much more limited.

I used 'byte' a lot, in C/C++ code implementing functionality like image compression (i.e. running a compression algorithm over each byte of a black-and-white bitmap), and processing binary network messages (by interpreting the bytes in the message).
However I have virtually never used 'float' or 'double'.

The primary usage I've seen for them is while processing data with an unknown structure or even no real structure. Network programming is an example of the former (whoever is sending the data knows what it means but you might not), something like image compression of 256-color (or grayscale) images is an example of the latter.
Off the top of my head grep comes to mind as another use, as does any sort of file copy. (Sure, the OS will do it--but sometimes that's not good enough.)

The Java language itself makes it unreasonably difficult to use the byte or short types. Whenever you perform any operation on a byte or short value, Java promotes it to an int first, and the result of the operation is returned as an int. Also, they're signed, and there are no unsigned equivalents, which is another frequent source of frustration.
So you end up using byte a lot because it's still the basic building block of all things cyber, but the short type might as well not exist.

Until today I haven't notice how seldom I use them.
I've use byte for network related stuff, but most of the times they were for my own tools/learning. In work projects these things are handled by frameworks ( JSP for instance )
Short? almost never.
Long? Neither.
My preferred integer literals are always int, for loops, counters, etc.
When data comes from another place ( a database for instance ) I use the proper type, but for literals I use always int.

I use bytes in lots of different places, mostly involving low-level data processing. Unfortunately, the designers of the Java language made bytes signed. I can't think of any situation in which having negative byte values has been useful. Having a 0-255 range would have been much more helpful.
I don't think I've ever used shorts in any proper code. I also never use floats (if I need floating point values, I always use double).
I agree with Tom. Ideally, in high-level languages we shouldn't be concerned with the underlying machine representations. We should be able to define our own ranges or use arbitrary precision numbers.

when we are programming for electronic devices like mobile phone we use byte and short.In this case we should take care on memory management.

It's perhaps more interesting to look at the semantics of int. Are those arbitrary limits and silent truncation what you want? For application-level code really wants arbitrary sized integers, it's just that Java has no way of expressing those reasonably.

I have used bytes when saving State while doing model checking. In that application the space savings are worth the extra work. Otherwise I never use them.

I found I was using byte variables when doing some low-level image processing. The .Net GDI+ draw routines were really slow so I hand-rolled my own.
Most times, though, I stick with signed integers unless I am forced to use something larger, given the problem constraints. Any sort of physics modeling I do usually requires floats or doubles, even if I don't need the precision.

Apache POI was using short quite a few times. Probably because of Excel's row/column number limitation.
A few months ago they changed to int replacing
createCell(short columnIndex)
with
createCell(int column).

On in-memory datagrids, it can be useful.
The concept of a datagrid like Gemfire is to have a huge distributed map.
When you don't have enough memory you can overflow to disk with LRU strategy, but the keys of all entries of your map remains in memory (at least with Gemfire).
Thus it is very important to make your keys with a small footprint, particularly if you are handling very large datasets.
For the entry value, when you can it's also better to use the appropriate type with a small memory footprint...

I have used shorts and bytes in Java apps communicating to custom usb or serial micro-controllers to receive 10bit values wrapped in 2 bytes as shorts.

bytes and shorts are extensively used in Java Card development. Take a look at my answer to Are there any real life uses for the Java byte primitive type?.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.