Preventing Sign Extension with Byte Mask

Preventing Sign Extension with Byte Mask - java

I've been reading the book TCP/IP Sockets in Java, 2nd Edition. I was hoping to get more clarity on something, but since the book's website doesn't having a forum or anything, I thought I'd ask here.
In several places, the book uses a byte mask to avoid sign extension. Here's an example:
private final static int BYTEMASK = 0xFF; //8 bits
public static long decodeIntBigEndian(byte[] val, int offset, int size) {
long rtn = 0;
for(int i = 0; i < size; i++) {
rtn = (rtn << Byte.SIZE) | ((long) val[offset + i] & BYTEMASK);
}
return rtn;
}
So here's my guess of what's going on. Let me know if I'm right.
BYTEMASK in binary should look like 00000000 00000000 00000000 11111111.
To make things easy, let's just say the val byte array only contains 1 short so the offset is 0. So let's set the byte array to val[0] = 11111111, val[1] = 00001111. At i = 0, rtn is all 0's so rtn << Byte.SIZE just keeps the value the same. Then there's (long)val[0] making it 8 bytes with all 1's due to sign extension. But when you use & BYTEMASK, it sets all those extra 1's to 0's, leaving that last byte all 1's. Then you get rtn | val[0] which basically flips on any 1's in the last byte of rtn. For i = 1, (rtn << Byte.SIZE) pushes the least-significant byte over and leaves all 0's in place. Then (long)val[1] makes a long with all zero's plus 00001111 for the least-significant byte which is what we want. So using & BYTEMASK doesn't change it. Then when rtn | val[1] is used, it flips rtn's least-significant byte to all 1's. The final return value is now rtn = 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111.
So, I hope this wasn't too long, and it was understandable. I just want to know if the way I'm thinking about this is correct, and not just completely wacked out logic. Also, one thing that confuses me is the BYTEMASK is 0xFF. In binary, this would be 11111111 11111111, so if it's being implicitly cast to an int, wouldn't it actually be 11111111 11111111 11111111 11111111 due to sign-extension? If that's the case, then it doesn't make sense to me how BYTEMASK would even work. Thank you for reading.

Everything is right except for the last point:
0xFF is already an int (0x000000FF), so it won't be sign-extended. In general, integer number literals in Java are ints unless they end with an L or l and then they are longs.

Related

Variable length-encoding of int to 2 bytes

I'm implementing variable lenght encoding and reading wikipedia about it. Here is what I found:
0x00000080 0x81 0x00
It mean 0x80 int is encoded as 0x81 0x00 2 bytes. That what I cannot understand. Okay, following the algorithm listed there we have.
Binary 0x80: 00000000 00000000 00000000 10000000
We move the sign bit to the next octet so we have and set to 1 (indicating that we have more octets):
00000000 00000000 00000001 10000000 which is not equals to 0x81 0x00. I tried to write a program for that:
byte[] ba = new byte[]{(byte) 0x81, (byte) 0x00};
int first = (ba[0] & 0xFF) & 0x7F;
int second = ((ba[1] & 0xFF) & 0x7F) << 7;
int result = first | second;
System.out.println(result); //prints 1, not 0x80
ideone
What did I miss?

Let's review the algorithm from the Wikipedia page:
Take the binary representation of the integer
Split it into groups of 7 bits, the group with the highest value will have less
Take these seven bits as a byte, setting the MSB (most significant bit) to 1 for all but the last; leave it 0 for the last one
We can implement the algorithm like this:
public static byte[] variableLengthInteger(int input) {
// first find out how many bytes we need to represent the integer
int numBytes = ((32 - Integer.numberOfLeadingZeros(input)) + 6) / 7;
// if the integer is 0, we still need 1 byte
numBytes = numBytes > 0 ? numBytes : 1;
byte[] output = new byte[numBytes];
// for each byte of output ...
for(int i = 0; i < numBytes; i++) {
// ... take the least significant 7 bits of input and set the MSB to 1 ...
output[i] = (byte) ((input & 0b1111111) | 0b10000000);
// ... shift the input right by 7 places, discarding the 7 bits we just used
input >>= 7;
}
// finally reset the MSB on the last byte
output[0] &= 0b01111111;
return output;
}
You can see it working for the examples from the Wikipedia page here, you can also plug in your own values and try it online.

Another Variable length encoding of integers exists and are widely used. For example ASN.1 from 1984 does define "length" field as:
The encoding of length can take two forms: short or long. The short
form is a single byte, between 0 and 127.
The long form is at least two bytes long, and has bit 8 of the first
byte set to 1. Bits 7-1 of the first byte indicate how many more bytes
are in the length field itself. Then the remaining bytes specify the
length itself, as a multi-byte integer.
This encoding is used for example in DLMS COSEM protocol or https certificates. For simple code, you can have a look at ASN.1 java library.

Manipulating with bit representation of int

I have some int and short like this:
int a = //...
short b = //..
What is the fastest way to craft int c with the following bit representation:
2nd and 3rd most significant bytes of a consists of bytes representation of b.
The rest of the bytes of a are left unchanged.
Maybe bitwise OR will help here but still dont see how.
For example:
a = 01010101 01010101 01010101 01010101
b = 11111111 11111111
Then we have
c = 01010101 11111111 11111111 01010101

Remove what used to be in those bytes, then put in b:
c = (a & 0xFF0000FF) | ((b << 8) & 0x00FFFF00);
The extra & after the shift is to counteract the sign-extension, which would otherwise overwrite the top byte with 1's whenever b is negative.

Narrowing from int to short [duplicate]

This question already has answers here:
Java - Explicit Conversion from Int to Short
(3 answers)
Closed 7 years ago.
I am working on narrowing and checked the following code :-
int i = 131072;
short s = (short)i;
System.out.println(s); //giving 0
This narrowing is outputting 0. I am not able to get the logic behind.

131072 int is 00000000 00000010 00000000 00000000 in binary.
When you case it to short, only the lowest 16 bits remain - 00000000 00000000.

When you cast a primitive to a smaller primitive the top bits are dropped.
Written another way you can see what is happening.
int i = 0x20000;
short s = (short) (i & 0xFFFF);
Note: the lower 16 bits of your integer are all zero, so the answer is 0.
As binary casting to (short) keeps only the lower 16 bits.
00000000 00000010 (00000000 00000000)
If you were to cast a longer number, it would still take the lower bits in each case. Note: the & in each case is redundant and only to help clarity.
long l = 0x0FEDCBA987654321L;
// i = 0x87654321
int i = (int) (l & 0xFFFFFFFFL);
// c = \u4321
char c = (char) (l & 0xFFFF);
// s = 0x4321
short s = (short) (l & 0xFFFF);
// b = 0x21
byte b = (byte) (l & 0xFF);

Primitive values (int, short,...) are stored as binary values. int uses more bits than short. When you try to downcast you're cutting away bits which truncates (and potentially ruins) the value.

This is not a down cast (which refers to objects), it's a narrowing cast, or truncation. When you perform such a cast, you just copy the two least significant bytes of the int to your short. If the integer if smaller than 215 you'd just ignore bytes containing zeroes, so it would just work.
This is not the case here, however. If you examine the binary representation of 131072 you'll see it's 100000000000000000. So, the two least significant bytes are clearly 0, which is exactly what you're getting.

Why Java unsigned bit shifting for a negative byte is so strange?

I have a byte variable:
byte varB = (byte) -1; // binary view: 1111 1111
I want to see the two left-most bits and do an unsigned right shift of 6 digits:
varB = (byte) (varB >>> 6);
But I'm getting -1 as if it was int type, and getting 3 only if I shift for 30!
How can I work around this and get the result only with a 6-digit shift?

The reason is the sign extension associated with the numeric promotion to int that occurs when bit-shifting. The value varB is promoted to int before shifting. The unsigned bit-shift to the right does occur, but its effects are dropped when casting back to byte, which only keeps the last 8 bits:
varB (byte) : 11111111
promoted to int : 11111111 11111111 11111111 11111111
shift right 6 : 00000011 11111111 11111111 11111111
cast to byte : 11111111
You can use the bitwise-and operator & to mask out the unwanted bits before shifting. Bit-anding with 0xFF keeps only the 8 least significant bits.
varB = (byte) ((varB & 0xFF) >>> 6);
Here's what happens now:
varB (byte) : 11111111
promoted to int : 11111111 11111111 11111111 11111111
bit-and mask : 00000000 00000000 00000000 11111111
shift right 6 : 00000000 00000000 00000000 00000011
cast to byte : 00000011

Because thats how shifting for bytes in java is defined in the language: https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#jls-15.19.
The gist is that types smaller than int are silently widened to int, shifted and then narrowed back.
Which makes your single line effectively the equivalent of:
byte b = -1; // 1111_1111
int temp = b; // 1111_1111_1111_1111_1111_1111_1111_1111
temp >>>= 6; // 0000_0011_1111_1111_1111_1111_1111_1111
b = (byte) temp; // 1111_1111
To shift just the byte you need to make the widening conversion explicitly yourself with unsigned semantics (and the narrowing conversion needs to be manually, too):
byte b = -1; // 1111_1111
int temp = b & 0xFF; // 0000_0000_0000_0000_0000_0000_1111_1111
temp >>>= 6; // 0000_0000_0000_0000_0000_0000_0000_0011
b = (byte) temp; // 0000_0011

One problem with the top answer is that, although it works correctly for unsigned >>> right shift, it doesn't for signed >> right shift. This is because >> depends on the sign bit (the one farthest to the left) which moves when it's promoted to int. This means when you use >>, you'll get 00000011 when you might expect 11111111. If you want a trick that works for both, try shifting left by 24, doing your chosen right shift, then shifting back to the right by 24. That way your byte data's sign bit is in the right place.
varB = (byte) (varB << 24 >> 6 >> 24);
I've [bracketed] the sign bit. Here's what's happening:
varB (byte) : [1]1111111
promoted to int : [1]1111111 11111111 11111111 11111111
shift left 24 : [1]1111111 00000000 00000000 00000000
signed shift right 6 : [1]1111111 11111100 00000000 00000000
shift right 24 : [1]1111111 11111111 11111111 11111111
cast to byte : [1]1111111
Here you can see it also works for >>>:
varB = (byte) (varB << 24 >>> 6 >> 24);
varB (byte) : [1]1111111
promoted to int : [1]1111111 11111111 11111111 11111111
shift left 24 : [1]1111111 00000000 00000000 00000000
unsigned shift right 6 : [0]0000011 11111100 00000000 00000000
shift right 24 : [0]0000000 00000000 00000000 00000011
cast to byte : [0]0000011
This costs more operations for the convenience of not having to remember the rules about which one you should and shouldn't bitmask. So use whatever solution works for you.
Btw, it's good to know that short is also promoted to int which means everything in these answers applies to it as well. The only difference is that you shift left/right by 16, and the bitmask is 0xFFFF.

Sign extension, bit shifting in JAVA. Help understanding a C-code bit

I have the following C-code (from FFMPEG):
static inline av_const int sign_extend(int val, unsigned bits)
{
unsigned shift = 8 * sizeof(int) - bits;
union { unsigned u; int s; } v = { (unsigned) val << shift };
return v.s >> shift;
}
I'm trying to reproduce this in JAVA. But I have difficulties understanding this. No matter how I toss the bits around, I don't get very close.
As for the value parameter: it takes unsigned byte value as int.
Bits parameter: 4
If the value is 255 and bits is 4. It returns -1. I can't reproduce this in JAVA. Sorry for such fuzzy question. But can you help me understand this code?
The big picture is that I'm trying to encode EA ADPCM audio in JAVA. In FFMPEG:
https://gitorious.org/ffmpeg/ffmpeg/source/c60caa5769b89ab7dc4aa41a21f87d5ee177bd30:libavcodec/adpcm.c#L981

Strictly speaking, the result of running this code with this input data has unspecified results because signed bitshift in C is only properly defined in circumstances that this scenario does not meet. From the C99 standard:
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has unsigned type or if E1 has signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2E2. If E1 has a signed type and negative value, the resulting value is implementation-defined.
(Emphasis mine)
But let's assume that your implementation defines signed right shift to extend the sign, meaning that the space on the left will be filled with ones if the sign bit is set and zeroes otherwise; the ffmpeg code clearly expects this to be the case. The following is happening: shift has the value of 28 (assuming 32-bit integers). In binary notation:
00000000 00000000 00000000 11111111 = val
11110000 00000000 00000000 00000000 = (unsigned) val << shift
Note that when interpreting (unsigned) val << shift as signed integer, as the code proceeds to do (assuming two's complement representation, as today's computers all use1), the sign bit of that integer is set, so a signed shift to the right fills up with zeroes from the left, and we get
11110000 00000000 00000000 00000000 = v.s
11111111 11111111 11111111 11111111 = v.s >> shift
...and in two's complement representation, that is -1.
In Java, this trick works the same way -- except better, because there the behavior is actually guaranteed. Simply:
public static int sign_extend(int val, int bits) {
int shift = 32 - bits; // int always has 32 bits in Java
int s = val << shift;
return s >> shift;
}
Or, if you prefer:
public static int sign_extend(int val, int bits) {
int shift = 32 - bits;
return val << shift >> shift;
}
1 Strictly speaking, this conversion does not have a well-defined value in the C standard either, for historical reasons. There used to be computers that used different representations, and the same bit pattern with a set sign bit has a completely different meaning in (for example) signed magnitude representation.

The reason why the code looks so odd is that C language is full of undefined behaviours that in Java are well-defined. For example in C bit-shifting a signed integer left so that the sign-bit changes is undefined behaviour and at that point the program can do anything - whatever the compiler causes the program to do - crash, print 42, make true = false, anything can happen, and the compiler still compiled it correctly.
Now the code uses a 1 trick to shift the integer left: it uses an union that lays the bytes of members top of each other - making an unsigned and an signed integer to occupy the same bytes; the bitshift is defined with the unsigned integer; so we do the unsigned shift using it; then shift back using the signed shift (the code assumes that the right shift of a negative number produces properly sign-extended negative numbers, which is also not guaranteed by standard but usually these kinds of libraries have a configuration utility that can refuse compilation on such a quite esoteric platform; likewise this program assumes that CHAR_BIT is 8; however C only makes a guarantee that a char is at least 8 bits wide.
In Java, you do not need anything like a union to accomplish this; instead you do:
static int signExtend(int val, int bits) {
int shift = 32 - bits; // fixed size
int v = val << shift;
return v >> shift;
}
In Java the width of an int is always 32 bits; << can be used for both signed and unsigned shift; and there is no undefined behaviour for extending to the sign bit; >> can be used for signed shift (>>> would be unsigned).

given this code:
static inline av_const int sign_extend(int val, unsigned bits)
{
unsigned shift = 8 * sizeof(int) - bits;
union { unsigned u; int s; } v = { (unsigned) val << shift };
return v.s >> shift;
}
the 'static' modifier says the function is not visible outside the current file.
The 'inline' modifier is a 'request' to the compiler to place the code 'inline' whereever the function is called rather than having a separate function with the associated call/return code sequences
the 'sign_extend' is the name of the function
in C, a right shift, for a signed value will propagate the sign bit,
In C, a right shift, for a unsigned value will zero fill.
It looks like your java is doing the zero fill.
regarding this line:
unsigned shift = 8 * sizeof(int) - bits;
on a 32bit machine, an integer is 32 bits and size of int is 4
so the variable 'shift' will contain (8*4)-bits
regarding this line:
union { unsigned u; int s; } v = { (unsigned) val << shift };
left shift of unsigned will shift the bits left,
with the upper bits being dropped into the bit bucket
and the lower bits being zero filled.
regarding this line:
return v.s >> shift;
this shifts the bits back to their original position,
while propagating the (new) sign bit

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Preventing Sign Extension with Byte Mask - java

Everything is right except for the last point: 0xFF is already an int (0x000000FF), so it won't be sign-extended. In general, integer number literals in Java are ints unless they end with an L or l and then they are longs.

Related

Variable length-encoding of int to 2 bytes

Manipulating with bit representation of int

Narrowing from int to short [duplicate]

Why Java unsigned bit shifting for a negative byte is so strange?

Sign extension, bit shifting in JAVA. Help understanding a C-code bit

Categories

Resources