How should one properly handle a conversion rounding error with BigDecimal in Java:
BigDecimal -> byte[] -> BigDecimal
I have a custom datatype 32 bytes in length (yep, 32 bytes not 32 bits) and I need to decode the fractional part of BigDecimal into byte[].
I understand that I will lose some accuracy. Are there any established techniques to implement such a conversion?
NOTE:
It is fixed point datatype of form MxN, where M % 8 == N % 8 == 0
Your fixed-point fractional part can be interpreted as the numerator, n, of a fraction n/(2256). I suggest, therefore, computing the BigDecimal value representing 1/(2256) (this is exactly representable as a BigDecimal) and storing a reference to it in a final static field.
To convert to a byte[], then, use the two-arg version of BigDecimal.divideToIntegralValue() to divide the fractional part of your starting number by 1/(2256), using the MathContext argument to specify the rounding mode you want. Presumably you want either RoundingMode.HALF_EVEN or RoundingMode.HALF_UP. Then get the BigInteger unscaled value of the result (which should be numerically equal to the scaled value, since an integral value should have scale 0) via BigDecimal.unscaledValue(). BigInteger.toByteArray() will then give you a byte[] closely related to what you're after.*
To go the other way, you can pretty much reverse the process. BigDecimal has a constructor that accepts a byte[] that, again, is very closely related to your representation. Using that constructor, convert your byte[] to a BigInteger, and thence to BigDecimal via the appropriate constructor. Multiply by that stored 1/(2256) value to get the fractional part you want.
* The biggest trick here may involve twiddling signs appropriately. If your BigDecimals may be negative, then you probably want to first obtain their absolute values before converting to byte[]. More importantly, the byte[]s produced and consumed by BigInteger use a two's complement representation (i.e. with a sign bit), whereas I suppose you'll want an unsigned, pure binary representation. That mainly means that you'll need to allow for an extra bit -- and therefore a whole extra byte -- when you convert. Be also aware of byte order; check the docs of BigInteger for the byte order it uses, and adjust as appropriate.
Related
I've 2 integer values stored in a bytebuffer, in little-endian format. These integers are actually the 32-bit pieces of a long. I've to store them as a class' member variables, loBits and hiBits.
This is what I did:
long loBits = buffer.getInt(offset);
long hiBits = buffer.getInt(offset + Integer.BYTES);
I want to know why directly assigning signed int to long is wrong. I kind of know what's going on, but would really appreciate an explanation.
The int I read from the buffer is signed (because Java). If it is negative then directly assigning it to a long value (or casting it like (long)) would change all the higher order bits in the long to the signed bit value.
For e.g. Hex representation of an int, -1684168480 is 9b9da0e0. If I assign this int to a long, all higher order 32 bits would become F.
int negativeIntValue = -1684168480;
long val1 = negativeIntValue;
long val2 = (long) negativeIntValue;
Hex representation of:
negativeIntValue is 0x9b9da0e0
val1 is 0xffffffff9b9da0e0
val2 is 0xffffffff9b9da0e0
However, if I mask the negativeIntValue with 0x00000000FFFFFFFFL, I get a long which has the same hex representation as negativeIntValue and a positive long value of 2610798816.
So my questions are:
Is my understanding correct?
Why does this happen?
Yes, your understanding is correct (at least if I understood your understanding correctly).
The reason this happens is because (most) computers use 2's complement to store signed values. So when assigning a smaller datatype to a larger one, the value is sign extended meaning that the excess part of the datatype is filled with 0 or 1 bits depending on whether the original value was positive or negative.
Also related is the difference between >> and >>> operators in Java. The first one performs sign extending (keeping negative values negative) the second one does not (shifting a negative value makes it positive).
The reason for this is that negative values are stored as two's complement.
Why do we use two's complement?
In a fixed width numbering system what happens, if you substract 1 from 0?
0000b - 0001b -> 1111b
and what is the next lesser number to 0? It is -1.
Therfore we thread a binary number with all bits set (for a signed datatype) as -1
The big advantage is that the CPU does not need to do any special operation when changing from positive to negative numbers. It handles 5 - 3 the same as 3 - 5
When I do
Long.parseUnsignedLong("FBD626CC4961A4FC", 16)
I get back -300009666327239428
Which seems wrong, since the meaning of unsigned long according to this answer https://stackoverflow.com/a/2550367/1754020 is that the range is always positive.
To get the correct number from this HEX value I do
BigInteger value = new BigInteger("FBD626CC4961A4FC", 16);
When I print value it prints the correct value. but if I do value.longValue()
again I get the same -300009666327239428 is this of the number being too big and overflowing ?
Java 8 does (somewhat) support unsigned longs, however, you can't just print them directly. Doing so will give you the result that you saw.
If you have an unsigned long
Long number = Long.parseUnsignedLong("FBD626CC4961A4FC", 16);
you can get the correct string representation with the function
String numberToPrint = Long.toUnsignedString(number);
If you now print numberToPrint you get
18146734407382312188
To be more exact, your number is still going to be a regular signed long which is why it shows overflow if printed directly. However, there are new static functions that will treat the value as if it was unsigned, such as this Long.toUnsignedString(long x) or Long.compareUnsigned(long x, long y).
The hexadecimal number "FBD626CC4961A4FC", converted to decimal, is exactly 18146734407382312188. That number is indeed larger than the maximum possible long, defined as Long.MAX_VALUE and which is equal to 263-1, or 9223372036854775807:
System.out.println(new BigInteger("FBD626CC4961A4FC", 16)); // 18146734407382312188
System.out.println(Long.MAX_VALUE); // 9223372036854775807
As such, it's normal that you get back a negative number.
You do not have an exception, as it is exactly the purpose of those new *Unsigned* methods added in Java 8, to give the ability to handle unsigned longs (like compareUnsigned or divideUnsigned). Since the type long in Java is still unsigned, those methods work by understanding negative values as values greater than MAX_VALUE: it simulates an unsigned long. parseUnsignedLong says:
An unsigned integer maps the values usually associated with negative numbers to positive numbers larger than MAX_VALUE.
If you print a long that was the result of parseUnsignedLong, and it is negative, all it means is that the value is greater than the max long value as defined by the language, but that methods taking unsigned longs as parameter will correctly interpret those values, as if they were greater than the max value. As such, instead of printing it directly, if you pass that number to toUnsignedString, you'll get the right output, like shown in this other answer. Not all of these methods are new to Java 8, for example toHexString also interprets the given long as an unsigned long in base 16, and printing Long.toHexString(Long.parseUnsignedLong("FBD626CC4961A4FC", 16)) will give you back the right hex String.
parseUnsignedLong will throw an exception only when the value cannot be represented as an unsigned long, i.e. not a number at all, or greater than 264-1 (and not 263-1 which is the maximum value for a signed long).
Yes, it overflows when you are trying to print it, as it is converted to Java long type. To understand why let's take log2 of your dec value.
First thing, original value is 18146734407382312188. It's log2 is ~63.9763437545.
Second, look into documentation: in java long type represents values of -2^63 and a maximum value of 2^63-1.
So, your value is obviously greater then 2^63-1, hence it overflows:
-2^63 + (18146734407382312188 - 2^63 + 1) = -300009666327239428
But as #Keiwan brilliantly mentioned, you still can print proper value using Long.toUnsignedString(number);
Internally unsigned and signed numbers are represented in the same way, i.e. as 8 bytes in case of a long. The difference only how the "sign" bit interpreted, i.e. if you'd do the same in a C/C++ program and store your value into an uint64_t then cast/map it to a asigned int64_t you should get the same result.
Since the maximum value 8 bytes or 64 bits can hold is 2^64-1 that's the hard constraint for such numbers. Also Java doesn't directly support unsigned numbers and thus the only way to store an unsigned long in a long is to allow for a value that's higher than the signed Long.MAX_VALUE. In fact Java doesn't know whether the string/hexcode you're reading is meant to represent a signed or unsigned long so it's up to you to provide that interpretation, either by converting back to a string or using a larger datatype such as BigInteger.
Question:
This code in Java:
BigInteger mod = new BigInteger("86f71688cdd2612ca117d1f54bdae029", 16);
produces (in java) the number
179399505810976971998364784462504058921
However, when I use C#,
BigInteger mod = BigInteger.Parse("86f71688cdd2612ca117d1f54bdae029", System.Globalization.NumberStyles.HexNumber); // base 16
i don't get the same number, I get:
-160882861109961491465009822969264152535
However, when I create the number directly from decimal, it works
BigInteger mod = BigInteger.Parse("179399505810976971998364784462504058921");
I tried converting the hex string in a byte array and reversing it, and creating a biginteger from the reversed array, just in case it's a byte array with different endianness, but that didn't help...
I also encountered the following problem when converting Java-Code to C#:
Java
BigInteger k0 = new BigInteger(byte[]);
to get the same number in C#, I must reverse the array because of different Endianness in the biginteger implementation
C# equivalent:
BigInteger k0 = new BigInteger(byte[].Reverse().ToArray());
Here's what MSDN says about BigInteger.Parse:
If value is a hexadecimal string, the Parse(String, NumberStyles) method interprets value as a negative number stored by using two's complement representation if its first two hexadecimal digits are greater than or equal to 0x80. In other words, the method interprets the highest-order bit of the first byte in value as the sign bit. To make sure that a hexadecimal string is correctly interpreted as a positive number, the first digit in value must have a value of zero. For example, the method interprets 0x80 as a negative value, but it interprets either 0x080 or 0x0080 as a positive value.
So, add a 0 in front of the parsed hexadecimal number to force an unsigned interpretation.
As for round-tripping a big integer represented by a byte array between Java and C#, I'd advise against that, unless you really have to. But both implementations happen to use a compatible two's complement representation, if you fix the endianness issue.
MSDN says:
The individual bytes in the array returned by this method appear in little-endian order. That is, the lower-order bytes of the value precede the higher-order bytes. The first byte of the array reflects the first eight bits of the BigInteger value, the second byte reflects the next eight bits, and so on.
Java docs say:
Returns a byte array containing the two's-complement representation of this BigInteger. The byte array will be in big-endian byte-order: the most significant byte is in the zeroth element.
Please note I am NOT looking for code to cast or narrow a double to int.
As per JLS - $ 5.1.3 Narrowing Primitive Conversion
A narrowing conversion of a signed integer to an integral type T
simply discards all but the n lowest order bits, where n is the number
of bits used to represent type T.
So, when I try to narrow a 260 (binary representation as 100000100) to a byte then result is 4 because the lowest 8 bits is 00000100 which is a decimal 4 OR a long value 4294967296L (binary representation 100000000000000000000000000000000) to a byte then result is 0.
Now, why I want to know the rule for narrowing rule from double to int, byte etc. is when I narrow a double value 4294967296.0 then result is 2147483647 but when I narrow a long 4294967296L value then result is 0.
I have understood the long narrowing to int, byte etc. (discards all but the n lowest order bits) but I want to know what is going under the hoods in case of double narrowing.
I have understood the long narrowing to int, byte etc. (discards all but the n lowest order bits) but I want to know what is going under the hoods in case of double narrowing.
... I want to understand the why and how part.
The JLS (JLS 5.1.3) specifies what the result is. A simplified version (for int) is:
a NaN becomes zero
an Inf becomes "max-int" or "min-int"
otherwise:
round towards zero to get a mathematical integer
if the rounded number is too big for an int, the result becomes "min-int" or "max-int"
"How" is implementation specific. For examples of how it could be implemented, look at the Hotspot source code (OpenJDK version) or get the JIT compiler to dump some native code for you to look at. (I imagine that the native code maps uses a single instruction to do the actual conversion .... but I haven't checked.)
"Why" is unknowable ... unless you can ask one of the original Java designers / spec authors. A plausible explanation is a combination of:
it is easy to understand
it is consistent with C / C++,
it can be implemented efficiently on common hardware platforms, and
it is better than (hypothetical) alternatives that the designers considered.
(For example, throwing an exception for NaN, Inf, out-of-range would be inconsistent with other primitive conversions, and could be more expensive to implement.)
Result is Integer.MAX_VALUE when converting a double to an integer, and the value exceeds the range of an integer. Integer.MAX_VALUE is 2^31 - 1.
When you start with the double value 4294967296.0, it is greater than the greatest long value which is 2147483647 so the following rule is applied (from the page you cited) : The value must be too large (a positive value of large magnitude or positive infinity), and the result of the first step is the largest representable value of type int or long and you get 0x7FFFFFF = 2147483647
But when you try to convert 4294967296L = 0x100000000, you start from an integral type, so the rule is : A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits so if n is less than 32 (8 bytes) you just get a 0.
I don't understand how an int 63823, takes up less space than a double 1.0. Is there not more information stored in the int, in this particular instance?
I don't understand how an int 63823, takes up less space than a double 1.0. Is there not more information stored in the int, in this particular instance?
Good question. What you're seeing when you see 63823 and 1.0 is a representation of the underlying data, you are not seeing the underlying data. It is specially formatted so that you can read it, but it is not how the machine sees it.
Java uses very special formats for representing int and double. You need to look at those representations to understand why 63823 takes thirty-two bits when represented as a Java int and 1.0 takes sixty-four bits when represented as a Java double.
In particular, 63823 as an int in Java is represented as:
00000000000000001111100101001111
and 1.0 as a double is represented in Java as:
0011111111110000000000000000000000000000000000000000000000000000
If you want to explore more, I recommend Two's Complement and What Every Computer Scientist Should Know About Floating-Point Arithmetic.
Not exactly. The double 1.0 represents more information because, by the definition of a double as a 64 bit float, there are more values that it could be. To use your example, if you had a special data type that could only have two values, 63823 and 98321234213474932, then it would only take 1 bit to represent the number 63823, though it would be far less useful than an int.
In terms of implementation, it's often a lot easier and faster to work with fixed-size data types, so that you can allocate a fixed chunk of memory (that's what a variable is) without having to know it's value and constantly reallocate space. Examples of a variables with a different approach would be String and BigInteger, which do allocate space to accommodate their values. Note that both are immutable in Java -- that's not a coincidence.
These primitive datatypes need to be defined somewhere for you to use them. It is not a flexible container where you can stuff in whatever you want, rather more like a bottle which takes the same space no matter if full or empty. And they also have a maximum they can contain.
Read more yourself here.
The zeros that are not shown also count. Approximately, ignoring the fact that the numbers are actually stored in binary and not in decimal, when you write both numbers with the implied zero digits included, you get:
1.0 = 1.00000000000000000*10^0000
63823 = 0000063823
As you can see, 1.0 is twice as long as 63823. Therefore it requires twice as much storage.
The int and double don't have decimal digits at all. The decimal representation of the int has 8 decimal digits after removing leading zeros. The int itself has room for 32 binary digits. The double has room for 53 binary digits in the mantissa and a 10-bit exponent, and a sign bit.