I'm reading a file format that specifies some types are unsigned integers and shorts. When I read the values, I get them as a byte array. The best route to turning them into shorts/ints/longs I've seen is something like this:
ByteBuffer wrapped = ByteBuffer.wrap(byteArray);
int x = wrapped.getInt();
That looks like it could easily overflow for unsigned ints. Is there a better way to handle this scenario?
Update: I should mention that I'm using Groovy, so I absolutely don't care if I have to use a BigInteger or something like that. I just want the maximum safety on keeping the value intact.
A 32bit value, signed or unsigned, can always be stored losslessly in an int*. This means that you never have to worry about putting unsigned values in signed types from a data safety point of view.
The same is true for 8bit values in bytes, 16bit values in shorts and 64bit values in longs.
Once you've read an unsigned value into the corresponding signed type, you can promote them to signed values of a larger types to more easily work with the intended value:
Integer.toUnsignedLong(int)
Short.toUnsignedInt(short)
Byte.toUnsignedInt(byte)
Since there's no primitive type larger than long, you can either go via BigInteger, or use the convenience methods on Long to do unsigned operations:
BigInteger.valueOf(Long.toUnsignedString(long))
Long.divideUnsigned(long,long) and friends
* This is thanks to the JVM requiring integer types to be two's complement.
To hold an unsigned int/short/byte, you need to use the next "bigger" type, i.e. long/int/short. If you already hold the value in the signed type that can overflow, the conversion can be done by doing the following:
int unsignedVal = byteVal & 0xff
If you just cast them, the negative-bit will be regarded and you will still end up with the negative value.
If you have to handle unsigned longs you need to "switch" to java.math.BigInteger.
Unsigned primitives are a pain in Java.
There's no clean way of handing them, except using larger types with more bits, and taking care to avoid automatic sign extension when casting.
In your case, you can do something like this:
ByteBuffer wrapped = ByteBuffer.wrap(byteArray);
int signedInt = wrapped.getInt();
long unsigned = signedInt & 0xffffffffL;
I usually write the required conversion(s) in a utility class someplace, since they're easy to get wrong. If you copy & paste that one liner conversion everywhere, eventually one will be wrong.
Note that if you need unsigned longs, the only larger type is BigInteger.
If you need anything more than simple conversions, I suggest using Guava since it has some nice classes for dealing with unsigned types. See documentation here.
Related
Currently, I am using signed values, -2^63 to 2^63-1. Now I need the same range (2 * 2^64), but with positive values only. I found the java documentations mentioning unsigned long, which suits this use.
I tried to declare 2^64 to a Long wrapper object, but it still loses the data, in other words, it only captures till the Long.MAX_VALUE, so I am clearly missing something.
Is BigInteger the signed long that Java supports?
Is there a definition or pointer as to how to declare and use it?
In Java 8, unsigned long support was introduced. Still, these are typical longs, but the sign doesn't affect adding and subtracting. For dividing and comparing, you have dedicated methods in Long. Also, you can do the following:
long l1 = Long.parseUnsignedLong("12345678901234567890");
String l1Str = Long.toUnsignedString(l1)
BigInteger is a bit different. It can keep huge numbers. It stores them as int[] and supports arithmetic.
Although Java has no unsigned long type, you can treat signed 64-bit two's-complement integers (i.e. long values) as unsigned if you are careful about it.
Many primitive integer operations are sign agnostic for two's-complement representations. For example, you can use Java primitive addition, subtraction and multiplication on an unsigned number represented as a long, and get the "right" answer.
For other operations such as division and comparison, the Long class provides method like divideUnsigned and compareUnsigned that will give the correct results for unsigned numbers represented as long values.
The Long methods supporting unsigned operations were added in Java 8. Prior to that, you could use 3rd-party libraries to achieve the same effect. For example, the static methods in the Guava UnsignedLongs class.
Is BigInteger the signed long that Java supports?
BigInteger would be another way to represent integer values greater that Long.MAX_VALUE. But BigInteger is a heavy-weight class. It is unnecessary if your numbers all fall within the range 0 to 264 - 1 (inclusive).
If using a third party library is an option, there is jOOU (a spin off library from jOOQ), which offers wrapper types for unsigned integer numbers in Java. That's not exactly the same thing as having primitive type (and thus byte code) support for unsigned types, but perhaps it's still good enough for your use-case.
import static org.joou.Unsigned.*;
// and then...
UByte b = ubyte(1);
UShort s = ushort(1);
UInteger i = uint(1);
ULong l = ulong(1);
All of these types extend java.lang.Number and can be converted into higher-order primitive types and BigInteger. In your case, earlier versions of jOOU simply stored the unsigned long value in a BigInteger. Version 0.9.3 does some cool bit shifting to fit the value in an ordinary long.
(Disclaimer: I work for the company behind these libraries)
Maybe someone can help me understand because I feel I'm missing something that will likely have an effect on how my program runs.
I'm using a ByteArrayOutputStream. Unless I've missed something huge, the point of this class is to create a byte[] array for some other use.
However, the "plain" write function on BAOS takes an int not a byte (ByteArrayOutputStream.write).
According to this(Primitive Data Types) page, in Java, an int is a 32-bit data type and a byte is an 8-bit data type.
If I write this code...
int i = 32;
byte b = i;
I get a warning about possible lossy conversions requiring a change to this...
int i = 32;
byte b = (byte)i;
I'm really confused about write(int)...
ByteArrayOutputStream is just overriding the abstract method declared in OutputStream. So the real question is why OutputStream.write(int) is declared that way, when its stated goal is to write a single byte to the stream. The implementation of the stream is irrelevant here.
Your intuition is correct - it's a broken bit of design, in my view. And yes, it will lose data, as is explicitly called out in the docs:
The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
It would have been much more sensible (in my view) for this to be write(byte). The only downside is that you couldn't then call it with literal values without casting:
// Write a single byte 0. Works with current code, wouldn't work if the parameter
// were byte.
stream.write(0);
That looks okay, but isn't - because the type of the literal 0 is int, which isn't implicitly convertible to byte. You'd have to use:
// Ugly, but would have been okay with write(byte).
stream.write((byte) 0);
For me that's not a good enough reason to design the API the way it is, but that's what we've got - and have had since Java 1.0. It can't be fixed now without it being a breaking change all over the place, unfortunately.
In order to facilitate unsigned bytes above 0x7F this takes place. The int will be silently narrowed to be written. In fact, the code does that with a (byte) cast.
As Ingo states:
A possible reason could be that the byte to write will most often be the result of some operation that automatically converts its operands to int[, like] some bit operations. Hence, the code would be littered with casts to byte, that add nothing to understanding.
That's mostly because the Java Virtual Machine model's stack hates byte, but loves int. The stack uses 32-bit slots , which matches the size of an int.
You will notice however that java *does* like byte[] references. But that's because the content of arrays is stored in the heap (not the stack). And whenever a specific byte is addressed and moved to the stack (the bipush or sipush opcodes) then they are immediately converted to integers.
But sometimes java actually uses 257(!) values. When an InputStream#read() returns 256 values, but when it has no content it will return a -1 value. Alternatively, it would have been possible to throw an EOFException (like some other methods do) but exceptions are slow in java.
Even though you don't need the -1 value for the OutputStream#write it's coherent, and it reduces casting. But yes, it's missleading too.
Is it possible to index a Java array based on a byte?
i.e. something like
array[byte b] = x;
I have a very performance-critical application which reads b (in the code above) from a file, and I don't want the overhead of converting this to an int. What is the best way to achieve this? Is there a performance-decrease as a result of using this method of indexing rather than an int?
With many thanks,
Froskoy.
There's no overhead for "converting this to an int." At the Java bytecode level, all bytes are already ints.
In any event, doing array indexing will automatically upcast to an int anyway. None of these things will improve performance, and many will decrease performance. Just leave your code using an int.
The JVM specification, section 2.11.1:
Note that most instructions in Table 2.2 do not have forms for the integral types byte, char, and short. None have forms for the boolean type. Compilers encode loads of literal values of types byte and short using Java virtual machine instructions that sign-extend those values to values of type int at compile-time or runtime. Loads of literal values of types boolean and char are encoded using instructions that zero-extend the literal to a value of type int at compile-time or runtime. Likewise, loads from arrays of values of type boolean, byte, short, and char are encoded using Java virtual machine instructions that sign-extend or zero-extend the values to values of type int. Thus, most operations on values of actual types boolean, byte, char, and short are correctly performed by instructions operating on values of computational type int.
As all integer types in java are signed you have anyway to mask out 8 bits of b's value provided you do expect to read from the file values greater than 0x7F:
byte b;
byte a[256];
a [b & 0xFF] = x;
No; array indices are non-negative integers (JLS 10.4), but byte indices will be promoted.
No, there is no performance decrease, because on the moment you read the byte, you store it in a CPU register sometime. Those registers always works with WORDs, which means that the byte is always "converted" to an int (or a long, if you are on a 64 bit machine).
So, simply read your byte like this:
int b = (in.readByte() & 0xFF);
If your application is that performance critical, you should be optimizing elsewhere.
I have often heard complaints against Java for not having unsigned data types. See for example this comment. I would like to know how is this a problem? I have been programming in Java for 10 years more or less and never had issues with it. Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Since unsigned and signed numbers are represented with the same bit values, the only places I can think of where signedness matters are:
When converting the numbers to other bit representation. Between 8, 16 and 32 bit integer types you can use bitmasks if needed.
When converting numbers to decimal format, usually to Strings.
Interoperating with non-Java systems through API's or protocols. Again the data is just bits, so I don't see the problem here.
Using the numbers as memory or other offsets. With 32 bit ints this might be problem for very huge offsets.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those. What am I missing? What are the actual benefits of having unsigned types in a programming language and how would having those make Java better?
Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Why not? Is "applying a bitwise AND with 0xFF" actually part of what your code is trying to represent? If not, why should it have to be part of have you write it? I actually find that almost anything I want to do with bytes beyond just copying them from one place to another ends up requiring a mask. I want my code to be cruft-free; the lack of unsigned bytes hampers this :(
Additionally, consider an API which will always return a non-negative value, or only accepts non-negative values. Using an unsigned type allows you to express that clearly, without any need for validation. Personally I think it's a shame that unsigned types aren't used more in .NET, e.g. for things like String.Length, ICollection.Count etc. It's very common for a value to naturally only be non-negative.
Is the lack of unsigned types in Java a fatal flaw? Clearly not. Is it an annoyance? Absolutely.
The comment that you quote hits the nail on the head:
Java's lack of unsigned data types also stands against it. Yes, you can work around it, but it's not ideal and you'll be using code that doesn't really reflect the underlying data correctly.
Suppose you are interoperating with another system, which wants an unsigned 16 bit integer, and you want to represent the number 65535. You claim "the data is just bits, so I don't see the problem here" - but having to pass -1 to mean 65535 is a problem. Any impedance mismatch between the representation of your data and its underlying meaning introduces an extra speedbump when writing, reading and testing the code.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those.
The only times you would need to consider those operations is when you were naturally working with values of two different types - one signed and one unsigned. At that point, you absolutely want to have that difference pointed out. With signed types being used to represent naturally unsigned values, you should still be considering the differences, but the fact that you should is hidden from you. Consider:
// This should be considered unsigned - so a value of -1 is "really" 65535
short length = /* some value */;
// This is really signed
short foo = /* some value */;
boolean result = foo < length;
Suppose foo is 100 and length is -1. What's the logical result? The value of length represents 65535, so logically foo is smaller than it. But you'd probably go along with the code above and get the wrong result.
Of course they don't even need to represent different types here. They could both be naturally unsigned values, represented as signed values with negative numbers being logically greater than positive ones. The same error applies, and wouldn't be a problem if you had unsigned types in the language.
You might also want to read this interview with Joshua Bloch (Google cache, as I believe it's gone from java.sun.com now), including:
Ooh, good question... I'm going to say that the strangest thing about the Java platform is that the byte type is signed. I've never heard an explanation for this. It's quite counterintuitive and causes all sorts of errors.
If you like, yes, everything is ones and zeroes. However, your hardware arithmetic and logic unit doesn't work that way. If you want to store your bits in a signed integer value but perform operations that are not natural to signed integers, you will usually waste both storage space and processing time.
An unsigned integer type stores twice as many non-negative values in the same space as the corresponding signed integer type. So if you want to take into Java any data commonly used in a language with unsigned values, such as a POSIX date value (unsigned number of seconds) that is normally used with C, then in general you will need to use a wider integer type than C would use. If you are processing many such values, again you will waste both storage space and fetch-execute time.
The times I have used unsigned data types have been when I read in large blocks of data that correspond to images, or worked with openGL. I personally prefer unsigned if I know something will never be negative, as a "safety feature" of sorts.
Unsigned types are useful for bit-by-bit comparisons, and I'm pretty sure they are used extensively in graphics.
I need to convert a piece of code from Objective-C to Java, but I have a problem understanding the Primitive Types in Objective-C. So I had their data types in my Objective-C code :
UInt64, Uint32, UInt8 ,
which are unsigned integers (as I understand from internet). So my question is, can I use Java primitive types like byte (8bit) - instead of UInt8, int (32bit) - instead of UInt32, and long (64bit) - instead of UInt64.
Unfortunately, it isn't a straight translation and without knowing more about your program, its hard to suggest what the "right" approach is.
If your UInt8 values really range from 0-255, you may have to use Java signed int to be able to hold the entire range.
If you are dealing with byte streams or memory layouts and really need to use just a single byte of memory, than you could try byte, but you may have to test and handle cases to handle when the high-bit is set (value > 127). Ditto with the other unsigned types.
Ideally, if your code just kind of "defaulted" to the unsigned types, but really the signed versions would have worked fine too (i.e. the ranges of your values never equal or exceed 2^7, 2^15, or 2^31 respectively), then you may be fine with the "straight" translation to byte, int, and long.
Yes, those are the correctly sized data types to use in Java. Make sure you take into account that Java does not have unsigned types and the trick is to use the next largest size. 64 bit unsigned arithmetic requires special consideration.