How to avoid sign extending bit mask in Java?

How to avoid sign extending bit mask in Java? - java

My bit masks are bytes, and I'd like to keep them exactly as they are, but I think they're sign extended. I don't care if the byte is considered positive or negative, as long as it has the same bits set. I just spent a few hours debugging my code, and then I found I'm only having a problem with my byte bit masks when they happen to be negative, it took a while to find out. I can't be the only one who's had a problem with this. Is there a way to make a byte behave as if it was unsigned?

If you don't want a byte to sign extend when you use it in arithmetic (or bitwise) operators, you need to explicitly bitwise-and it with 0xFF. It looks slightly ugly but is unavoidable if what you have is a byte (and hopefully a decent JIT will be able to recognize the idiom and make efficient code out of it anyway).

Do you have right-shift in your code? do you use '>>' instead of '>>>'? There is your problem.

Related

Java: Thinking in Java 4th edition. Why would one use a hexadecimal here?

"Start with a number that has a binary one in the most significant position (hint: Use a hexadecimal constant). Using the signed right-shift operator, right shift it all the way through all of its binary positions, each time displaying the result using Integer.toBinaryString( )."
"hint:use a hexadecimal constant", why is that? Isn't possible and simply easier to just declare the constant like this: int i = Integer.parseInt("10101010", 2); ?

Well, you don't have to use a hexadecimal constant, as you figured out. It's just nice because it's short, and you don't need to call any methods to use it. In addition, you get some compile-time checks that would guarantee that what you have is actually an int.
So in your case, you want a number with a binary one in the most significant position. For an int, that's this in binary:
1000 0000 0000 0000 0000 0000 0000 0000
So let's explore some ways of writing this in code. You can do what you suggested:
int i = Integer.parseInt("10000000000000000000000000000000", 2);
But there are some issues with this:
Not very readable
You don't get compile-time checks to make sure what you really have is an int
So if you did this by accident:
int i = Integer.parseInt("100000000000000000000000000000000", 2); // See the mistake?
You get this at runtime:
Exception in thread "main" java.lang.NumberFormatException: For input string: "100000000000000000000000000000000"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at Test.main(Test.java:3)
And that's no fun. Accidentally providing a number that is out of range (e.g. by adding another 0 by accident) results in a runtime exception. Doesn't make debugging too fun, does it...
Compile-time constants are much more preferable. If you have Java 7 or above, you can do this:
int i = 0b1000_0000_0000_0000_0000_0000_0000_0000;
While still awkward, it's significantly more readable, and you at least get a compile-time error if you add one zero too many.
Hex is even better:
int i = 0x8000_0000;
So while it isn't required, it's probably recommended. You can expect that any programmers reading your code would understand hex, and it's short, readable, and you have the compiler catching some mistakes.
So to wrap up: Yes, it's possible to use your suggested syntax, but I wouldn't say it's easier. Hex is short, more readable, and you have less of a chance of leaving a mistake in your code. If you want to be really explicit, I'd still recommend using the binary literal over the parseInt() method, as it is safer.
(By the way, the example in your question isn't what you were asked for, but I'm assuming that was an example of syntax and not what you actually had)

Java/Python MD5 implementation- how to overcome unsigned 32-bit requirement?

I'm attempting to implement MD5 (for curiosity's sake) in Python/Java, and am effectively translating the wikipedia MD5 page's pseudocode into either language. First, I used Java, only to encounter frustration with its negative/positive integer overflow (because unsigned ints aren't an option, for-all integer,-2147483648 <= integer <= 2147483647). I then employed Python, after deciding that it's better suited for heavy numerical computation, but realized that I wouldn't be able to overcome the unsigned 32-bit integer requirement, either (as Python immediately casts wrapped ints to longs).
Is there any way to hack around Java/Python's lack of unsigned 32-bit integers, which are required by the aforementioned MD5 pseudocode?

Since all the operations are bitwise operations, they wouldn't suffer from sign extension (which would cause you problems), except for right shift.
Java has a >>> operator for this purpose.

As a note beforehand - I don't know if this is a good solution, but it appears to give the behaviour you want.
Using the ctypes module, you can access the underlying low-level data-type directly, and hence have an unsigned int in Python.
Specifically, ctypes.c_uint:
>>> i = ctypes.c_uint(0)
>>> i.value -= 1
>>> i
c_uint(4294967295)
>>> i.value += 1
>>> i
c_uint(0)
This is arguably abuse of the module - it's designed for using C code easily from within Python, but as I say, it appears to work. The only real downside I can think of is that I assume ctypes is CPython specific.

somehow working with unsigned bytes in Java

I'm trying to create a Java program that writes files for my Adruino to read. The Arduino is a simple 8 bit microcontroller board, and with some extra hardware, can read text files from SD cards, byte by byte.
Turns out this was a whole lot harder than I thought. Firstly, there are no unsigned values in Java. Not even bytes for some reason! Even trying to set a byte to 0xFF gives a possible loss of precision error! This isn't very useful for this low-level code..
I would use ints and only use the positive values, but I like using byte overflow to my advantage in a lot of my code (though I could probably do this with modulus right after the math operation or something) and the biggest problem of all is I have no idea how to add an int as an 8 bit character to a String that gets written to a file later. Output is currently my biggest problem.
So, what would be the best way to do unsigned bit math based on some user input and then write those bits to a file as if each one was an ASCII character?

So, here's how it works.
You can treat Java bytes as unsigned. The only places where signs make a difference are
constants: just cast them to bytes
toString and parseInt
division
<, >, >=, <=
Operations where signedness does not matter:
addition
subtraction
multiplication
bit arithmetic (except for >>, just use >>> instead)
To convert bytes to their unsigned values as ints, just use & 0xFF, and to convert those to bytes use (byte).
Alternatively, if third-party libraries are acceptable, you might be interested in Guava's UnsignedBytes utility class. (Disclosure: I contribute to Guava.)

Why would you need unsigned types in Java?

I have often heard complaints against Java for not having unsigned data types. See for example this comment. I would like to know how is this a problem? I have been programming in Java for 10 years more or less and never had issues with it. Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Since unsigned and signed numbers are represented with the same bit values, the only places I can think of where signedness matters are:
When converting the numbers to other bit representation. Between 8, 16 and 32 bit integer types you can use bitmasks if needed.
When converting numbers to decimal format, usually to Strings.
Interoperating with non-Java systems through API's or protocols. Again the data is just bits, so I don't see the problem here.
Using the numbers as memory or other offsets. With 32 bit ints this might be problem for very huge offsets.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those. What am I missing? What are the actual benefits of having unsigned types in a programming language and how would having those make Java better?

Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Why not? Is "applying a bitwise AND with 0xFF" actually part of what your code is trying to represent? If not, why should it have to be part of have you write it? I actually find that almost anything I want to do with bytes beyond just copying them from one place to another ends up requiring a mask. I want my code to be cruft-free; the lack of unsigned bytes hampers this :(
Additionally, consider an API which will always return a non-negative value, or only accepts non-negative values. Using an unsigned type allows you to express that clearly, without any need for validation. Personally I think it's a shame that unsigned types aren't used more in .NET, e.g. for things like String.Length, ICollection.Count etc. It's very common for a value to naturally only be non-negative.
Is the lack of unsigned types in Java a fatal flaw? Clearly not. Is it an annoyance? Absolutely.
The comment that you quote hits the nail on the head:
Java's lack of unsigned data types also stands against it. Yes, you can work around it, but it's not ideal and you'll be using code that doesn't really reflect the underlying data correctly.
Suppose you are interoperating with another system, which wants an unsigned 16 bit integer, and you want to represent the number 65535. You claim "the data is just bits, so I don't see the problem here" - but having to pass -1 to mean 65535 is a problem. Any impedance mismatch between the representation of your data and its underlying meaning introduces an extra speedbump when writing, reading and testing the code.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those.
The only times you would need to consider those operations is when you were naturally working with values of two different types - one signed and one unsigned. At that point, you absolutely want to have that difference pointed out. With signed types being used to represent naturally unsigned values, you should still be considering the differences, but the fact that you should is hidden from you. Consider:
// This should be considered unsigned - so a value of -1 is "really" 65535
short length = /* some value */;
// This is really signed
short foo = /* some value */;
boolean result = foo < length;
Suppose foo is 100 and length is -1. What's the logical result? The value of length represents 65535, so logically foo is smaller than it. But you'd probably go along with the code above and get the wrong result.
Of course they don't even need to represent different types here. They could both be naturally unsigned values, represented as signed values with negative numbers being logically greater than positive ones. The same error applies, and wouldn't be a problem if you had unsigned types in the language.
You might also want to read this interview with Joshua Bloch (Google cache, as I believe it's gone from java.sun.com now), including:
Ooh, good question... I'm going to say that the strangest thing about the Java platform is that the byte type is signed. I've never heard an explanation for this. It's quite counterintuitive and causes all sorts of errors.

If you like, yes, everything is ones and zeroes. However, your hardware arithmetic and logic unit doesn't work that way. If you want to store your bits in a signed integer value but perform operations that are not natural to signed integers, you will usually waste both storage space and processing time.
An unsigned integer type stores twice as many non-negative values in the same space as the corresponding signed integer type. So if you want to take into Java any data commonly used in a language with unsigned values, such as a POSIX date value (unsigned number of seconds) that is normally used with C, then in general you will need to use a wider integer type than C would use. If you are processing many such values, again you will waste both storage space and fetch-execute time.

The times I have used unsigned data types have been when I read in large blocks of data that correspond to images, or worked with openGL. I personally prefer unsigned if I know something will never be negative, as a "safety feature" of sorts.
Unsigned types are useful for bit-by-bit comparisons, and I'm pretty sure they are used extensively in graphics.

Why are there no byte or short literals in Java?

I can create a literal long by appending an L to the value; why can't I create a literal short or byte in some similar way? Why do I need to use an int literal with a cast?
And if the answer is "Because there was no short literal in C", then why are there no short literals in C?
This doesn't actually affect my life in any meaningful way; it's easy enough to write (short) 0 instead of 0S or something. But the inconsistency makes me curious; it's one of those things that bother you when you're up late at night. Someone at some point made a design decision to make it possible to enter literals for some of the primitive types, but not for all of them. Why?

In C, int at least was meant to have the "natural" word size of the CPU and long was probably meant to be the "larger natural" word size (not sure in that last part, but it would also explain why int and long have the same size on x86).
Now, my guess is: for int and long, there's a natural representation that fits exactly into the machine's registers. On most CPUs however, the smaller types byte and short would have to be padded to an int anyway before being used. If that's the case, you can as well have a cast.

I suspect it's a case of "don't add anything to the language unless it really adds value" - and it was seen as adding sufficiently little value to not be worth it. As you've said, it's easy to get round, and frankly it's rarely necessary anyway (only for disambiguation).
The same is true in C#, and I've never particularly missed it in either language. What I do miss in Java is an unsigned byte type :)

Another reason might be that the JVM doesn't know about short and byte. All calculations and storing is done with ints, longs, floats and doubles inside the JVM.

There are several things to consider.
1) As discussed above the JVM has no notion of byte or short types. Generally these types are not used in computation at the JVM level; so one can think there would be less use of these literals.
2) For initialization of byte and short variables, if the int expression is constant and in the allowed range of the type it is implicitly cast to the target type.
3) One can always cast the literal, ex (short)10

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.