How to implement Karatsuba multiplication using bit manipulation

How to implement Karatsuba multiplication using bit manipulation - java

I'm implementing Karatsuba multiplication in Scala (my choice) for an online course. Considering the algorithm is meant to multiply large numbers, I chose the BigInt type which is backed by Java BigInteger. I'd like to implement the algorithm efficiently, which using base 10 arithmetic is copied below from Wikipedia:
procedure karatsuba(num1, num2)
if (num1 < 10) or (num2 < 10)
return num1*num2
/* calculates the size of the numbers */
m = max(size_base10(num1), size_base10(num2))
m2 = floor(m/2)
/* split the digit sequences in the middle */
high1, low1 = split_at(num1, m2)
high2, low2 = split_at(num2, m2)
/* 3 calls made to numbers approximately half the size */
z0 = karatsuba(low1, low2)
z1 = karatsuba((low1 + high1), (low2 + high2))
z2 = karatsuba(high1, high2)
return (z2 * 10 ^ (m2 * 2)) + ((z1 - z2 - z0) * 10 ^ m2) + z0
Given that BigInteger is internally represented as an int[], if I can calculate m2 in terms of the int[], I can use bit shifting to extract the lower and higher halves of the number. Similarly, the last step can be achieved by bit shifting too.
However, it's easier said than done, as I can't seem to wrap my head around the logic. For example, if the max number is 999, the binary representation is 1111100111, lower half is 99 = 1100011, upper half is 9 = 1001. How do I get the above split?
Note:
There is an existing question that shows how to implement using arithmetic on BigInteger, but not bit shifting. Hence, my question is not a duplicate.

To be able to use bit shifting to do the splits and recombination, the base needs to be a power of two. Using two itself, as in the linked answer, is probably reasonable. Then the "length" of the inputs can be found directly with bitLength, and the split could be implemented as:
// x = a + 2^N b
BigInteger b = x.shiftRight(N);
BigInteger a = x.subtract(b.shiftLeft(N));
Where N is the size that a will have in bits.
Given that BigInteger is implemented with 32bit limbs, it makes sense to use 2³² as the base, ensuring that the big shifts involve only the movement of whole integers, and not also the slower code path where the BigInteger is shifted by a value between 1 and 31. This could be accomplished by rounding N to a multiple of 32.
The specific constant in this line,
if (N <= 2000) return x.multiply(y); // optimize this parameter
Should probably not be trusted too much, given that comment. For performance there should be some bound though, otherwise the recursive splitting goes too deeply. For example, when the size of the numbers is 32 or less, it's clearly better to just multiply, but probably a good cut-off is much higher. In this source of BigInteger itself, the cutoff is expressed in terms of the number of limbs instead of bits, and set to 80 (so 2560 bits) - it also has an other threshold above which it switches to 3-way Toom-Cook multiplication instead of Karatsuba multiplication.

Related

How does Math.pow() handle fractional exponents (specifically nth roots) [duplicate]

I'm trying to determine the asymptotic run-time of one of my algorithms, which uses exponents, but I'm not sure of how exponents are calculated programmatically.
I'm specifically looking for the pow() algorithm used for double-precision, floating point numbers.

I've had a chance to look at fdlibm's implementation. The comments describe the algorithm used:
* n
* Method: Let x = 2 * (1+f)
* 1. Compute and return log2(x) in two pieces:
* log2(x) = w1 + w2,
* where w1 has 53-24 = 29 bit trailing zeros.
* 2. Perform y*log2(x) = n+y' by simulating muti-precision
* arithmetic, where |y'|<=0.5.
* 3. Return x**y = 2**n*exp(y'*log2)
followed by a listing of all the special cases handled (0, 1, inf, nan).
The most intense sections of the code, after all the special-case handling, involve the log2 and 2** calculations. And there are no loops in either of those. So, the complexity of floating-point primitives notwithstanding, it looks like a asymptotically constant-time algorithm.
Floating-point experts (of which I'm not one) are welcome to comment. :-)

Unless they've discovered a better way to do it, I believe that approximate values for trig, logarithmic and exponential functions (for exponential growth and decay, for example) are generally calculated using arithmetic rules and Taylor Series expansions to produce an approximate result accurate to within the requested precision. (See any Calculus book for details on power series, Taylor series, and Maclaurin series expansions of functions.) Please note that it's been a while since I did any of this so I couldn't tell you, for example, exactly how to calculate the number of terms in the series you need to include guarantee an error that small enough to be negligible in a double-precision calculation.
For example, the Taylor/Maclaurin series expansion for e^x is this:
+inf [ x^k ] x^2 x^3 x^4 x^5
e^x = SUM [ --- ] = 1 + x + --- + ----- + ------- + --------- + ....
k=0 [ k! ] 2*1 3*2*1 4*3*2*1 5*4*3*2*1
If you take all of the terms (k from 0 to infinity), this expansion is exact and complete (no error).
However, if you don't take all the terms going to infinity, but you stop after say 5 terms or 50 terms or whatever, you produce an approximate result that differs from the actual e^x function value by a remainder which is fairly easy to calculate.
The good news for exponentials is that it converges nicely and the terms of its polynomial expansion are fairly easy to code iteratively, so you might (repeat, MIGHT - remember, it's been a while) not even need to pre-calculate how many terms you need to guarantee your error is less than precision because you can test the size of the contribution at each iteration and stop when it becomes close enough to zero. In practice, I do not know if this strategy is viable or not - I'd have to try it. There are important details I have long since forgotten about. Stuff like: machine precision, machine error and rounding error, etc.
Also, please note that if you are not using e^x, but you are doing growth/decay with another base like 2^x or 10^x, the approximating polynomial function changes.

The usual approach, to raise a to the b, for an integer exponent, goes something like this:
result = 1
while b > 0
if b is odd
result *= a
b -= 1
b /= 2
a = a * a
It is generally logarithmic in the size of the exponent. The algorithm is based on the invariant "a^b*result = a0^b0", where a0 and b0 are the initial values of a and b.
For negative or non-integer exponents, logarithms and approximations and numerical analysis are needed. The running time will depend on the algorithm used and what precision the library is tuned for.
Edit: Since there seems to be some interest, here's a version without the extra multiplication.
result = 1
while b > 0
while b is even
a = a * a
b = b / 2
result = result * a
b = b - 1

You can use exp(n*ln(x)) for calculating xn. Both x and n can be double-precision, floating point numbers. Natural logarithm and exponential function can be calculated using Taylor series. Here you can find formulas: http://en.wikipedia.org/wiki/Taylor_series

If I were writing a pow function targeting Intel, I would return exp2(log2(x) * y). Intel's microcode for log2 is surely faster than anything I'd be able to code, even if I could remember my first year calculus and grad school numerical analysis.

e^x = (1 + fraction) * (2^exponent), 1 <= 1 + fraction < 2
x * log2(e) = log2(1 + fraction) + exponent, 0 <= log2(1 + fraction) < 1
exponent = floor(x * log2(e))
1 + fraction = 2^(x * log2(e) - exponent) = e^((x * log2(e) - exponent) * ln2) = e^(x - exponent * ln2), 0 <= x - exponent * ln2 < ln2

Emulating multiplication of 128-bit integers with pairs of 64-bit integers [duplicate]

I need to multiply two 8 byte (64 bit) arrays in the fastest way possible. The byte arrays are little endian. The arrays can be wrapped in a ByteBuffer and treated as little endian to easily resolve a java "long" value that correctly represents the bytes (but not the real nominal value since java longs are 2s compliment).
Java's standard way to handle large math is BigInteger. But that implementation is slow and unnecessary since im very strictly working with 64 bits x 64 bits. In addition, you can't throw the "long" value into one because the nominal value is incorrect, and I can't use the byte array directly because it's little endian. I need to be able to do this without having to use up more memory / CPU to reverse the array. This type of multiplication should be able to execute 1m+ times per second. BigInteger doesn't really come close to meeting that requirement anyway, so I'm trying to do it via splitting the high order bits from the low order bits, but I can't get it working consistently.
The high-order-bits-only code is only working for a subset of longs because even the intermediate addition can overflow. I got my current code from this answer....
high bits of long multiplication in Java?
Is there a more generic pattern for getting hi/lo order bits from 128 bit multiplication? That works for the largest long values?
Edit:
FWIW I'm prepared for the answer to be.. "cant do that in java, do it in c++ and call via JNI". Though I'm hoping someone can give a java solution before it comes to that.

As of Java 9 (which was a bit too new at the time this question was asked), there is now a trivial way to get the upper half of the 128-bit product of two signed 64-bit integers: Math.multiplyHigh
There is a relatively simple conversion from "upper half of signed product" to "upper half unsigned product" (see Hacker's Delight chapter 8), which can be used to implement an unsigned multiply high like this:
static long multiplyHighUnsigned(long x, long y) {
long signedUpperHalf = Math.multiplyHigh(x, y);
return signedUpperHalf + ((x >> 63) & y) + ((y >> 63) & x);
}
This has the potential to be more efficient (on platforms on which multiplyHigh is treated as an intrinsic function by the JIT) than the more manual approach used by the old answer, which I will leave below the line.
It can be done manually without BigInteger by splitting the longs up into two halves, creating the partial products, and then summing them up. Naturally the low half of the sum can be left out.
The partial products overlap, like this:
LL
LH
HL
HH
So the high halves of LH and HL must be added to the high result, and furthermore the low halves of LH and HL together with the high half of LL may carry into the bits of the high half of the result. The low half of LL is not used.
So something like this (only slightly tested):
static long hmul(long x, long y) {
long m32 = 0xffffffffL;
// split
long xl = x & m32;
long xh = x >>> 32;
long yl = y & m32;
long yh = y >>> 32;
// partial products
long t00 = xl * yl;
long t01 = xh * yl;
long t10 = xl * yh;
long t11 = xh * yh;
// resolve sum and carries
// high halves of t10 and t01 overlap with the low half of t11
t11 += (t10 >>> 32) + (t01 >>> 32);
// the sum of the low halves of t10 + t01 plus
// the high half of t00 may carry into the high half of the result
long tc = (t10 & m32) + (t01 & m32) + (t00 >>> 32);
t11 += tc >>> 32;
return t11;
}
This of course treats the input as unsigned, which does not mean they have to be positive in the sense that Java would treat them as positive, you can absolutely input -1501598000831384712L and -735932670715772870L and the right answer comes out, as confirmed by wolfram alpha.
If you are prepared to interface with native code, in C++ with MSVC you could use __umulh, and with GCC/Clang you can make the product as an __uint128_t and just shift it right, the codegen for that is actually fine, it doesn't cause a full 128x128 multiply.

How can I handle 128 bit little endian multiplication in Java without resorting to BigInteger

As of Java 9 (which was a bit too new at the time this question was asked), there is now a trivial way to get the upper half of the 128-bit product of two signed 64-bit integers: Math.multiplyHigh
There is a relatively simple conversion from "upper half of signed product" to "upper half unsigned product" (see Hacker's Delight chapter 8), which can be used to implement an unsigned multiply high like this:
static long multiplyHighUnsigned(long x, long y) {
long signedUpperHalf = Math.multiplyHigh(x, y);
return signedUpperHalf + ((x >> 63) & y) + ((y >> 63) & x);
}
This has the potential to be more efficient (on platforms on which multiplyHigh is treated as an intrinsic function by the JIT) than the more manual approach used by the old answer, which I will leave below the line.
It can be done manually without BigInteger by splitting the longs up into two halves, creating the partial products, and then summing them up. Naturally the low half of the sum can be left out.
The partial products overlap, like this:
LL
LH
HL
HH
So the high halves of LH and HL must be added to the high result, and furthermore the low halves of LH and HL together with the high half of LL may carry into the bits of the high half of the result. The low half of LL is not used.
So something like this (only slightly tested):
static long hmul(long x, long y) {
long m32 = 0xffffffffL;
// split
long xl = x & m32;
long xh = x >>> 32;
long yl = y & m32;
long yh = y >>> 32;
// partial products
long t00 = xl * yl;
long t01 = xh * yl;
long t10 = xl * yh;
long t11 = xh * yh;
// resolve sum and carries
// high halves of t10 and t01 overlap with the low half of t11
t11 += (t10 >>> 32) + (t01 >>> 32);
// the sum of the low halves of t10 + t01 plus
// the high half of t00 may carry into the high half of the result
long tc = (t10 & m32) + (t01 & m32) + (t00 >>> 32);
t11 += tc >>> 32;
return t11;
}
This of course treats the input as unsigned, which does not mean they have to be positive in the sense that Java would treat them as positive, you can absolutely input -1501598000831384712L and -735932670715772870L and the right answer comes out, as confirmed by wolfram alpha.
If you are prepared to interface with native code, in C++ with MSVC you could use __umulh, and with GCC/Clang you can make the product as an __uint128_t and just shift it right, the codegen for that is actually fine, it doesn't cause a full 128x128 multiply.

Shifting BigInteger of Java by long variable

I know there are methods shiftLeft(int n) and shiftRight(int n) for BigInteger class which only takes int type as an argument but I have to shift it by a long variable. Is there any method to do it?

BigInteger can only have Integer.MAX_VALUE bits. Shifting right by more than this will always be zero. Shift left any value but zero will be an overflow.
From the Javadoc
* BigInteger constructors and operations throw {#code ArithmeticException} when
* the result is out of the supported range of
* -2<sup>{#code Integer.MAX_VALUE}</sup> (exclusive) to
* +2<sup>{#code Integer.MAX_VALUE}</sup> (exclusive).
If you need more than 2 billion bits to represent your value, you have a fairly usual problem, BigInteger wasn't designed for.
If you need to do bit manipulation on a very large scale, I suggest having an BitSet[] This will allow up to 2 bn of 2 bn bit sets, more than your addressable memory.
yes the long variable might go up to 10^10
For each 10^10 bit number you need 1.25 TB of memory. For this size of data, you may need to store it off heap, we have a library which persist this much data in a single memory mapping without using much heap, but you need to have this much space free on a single disk at least. https://github.com/OpenHFT/Chronicle-Bytes

BigInteger does not support values where long shift amounts would be appropriate. I tried
BigInteger a = BigInteger.valueOf(2).pow(Integer.MAX_VALUE);
and I got the following exception:
Exception in thread "main" java.lang.ArithmeticException: BigInteger would overflow supported range.

Since 2 ^ X is equal to 10 ^ (X * ln(2) / ln(10)), we can calculate for X = 10 ^ 10:
2 ^ (10 ^ 10) = 10 ^ 3,010,299,956.63981195...
= 10 ^ 3,010,299,956 * 10 ^ 0.63981195...
= 4.3632686... * 10 ^ 3,010,299,956
Meaning 4 followed by more than 3 billion more digits.
That's a very large number and will take some doing storing that to full precision.

Check division by 3 with binary operations?

I've read this interesting answer about "Checking if a number is divisible by 3"
Although the answer is in Java , it seems to work with other languages also.
Obviously we can do :
boolean canBeDevidedBy3 = (i % 3) == 0;
But the interesting part was this other calculation :
boolean canBeDevidedBy3 = ((int) (i * 0x55555556L >> 30) & 3) == 0;
For simplicity :
0x55555556L = "1010101010101010101010101010110"
Nb
There's also another method to check it :
One can determine if an integer is divisible by 3 by counting the 1
bits at odd bit positions, multiply this number by 2, add the number
of 1-bits at even bit positions add them to the result and check if
the result is divisible by 3
For example :
9310 ( is divisible by 3)
010111012
It has 2 bits in the odd places and 4 bits at the even places ( place is the zero based of the base 2 digit location)
So 2*1 + 4 = 6 which is divisible by 3.
At first I thought those 2 methods are related but I didn't find how.
Question
How does
boolean canBeDevidedBy3 = ((int) (i * 0x55555556L >> 30) & 3) == 0;
— actually determines if i%3==0 ?

Whenever you add 3 to a number, what you do is to add binary 11. Whatever the original value of the number, this will maintain the invariant that twice the number of 1 bits at odd positions, plus the number of 1 bits at even positions, will also be divisible by 3.
You can see that in this way. Let's call the value of the above expression c. You're adding 1 to an odd position, and 1 to an even position. When you add 1 to an even position, either the bit you've added 1 to was set or unset. If it was unset, you'll increase the value of c by 1, because you've added a new 1 in an odd position. If it was previously set, you'll flip that bit, but add a 1 in an even position (from the carry). This means that you initially decrease c by 1, but now when you add the 1 in the even position, you increase c by 2, so overall you've increased c by 2.
Of course, this carry bit might also get added to a bit that's already set, in which case we need to check that this part still increases c by 2: you'll remove a 1 in an even position (decreasing c by 2), and then add a 1 in an odd position (increasing c by 1), meaning that we've in fact decreased c by 1. That is the same as increasing c by 2, though, if we're working modulo 3.
A more formal version of this would be structured as a proof by induction.

The two methods do not appear to be related. The bit-wise method seems to be related to certain methods for the efficient computation of modulo b-1 when using digit base b, known in decimal arithmetic as "casting out nines".
The multiplication-based method is directly based on the definition of division when accomplished by multiplication with the reciprocal. Letting / denote mathematical division, we have
int_quot = (int)(i / 3)
frac_quot = i / 3 - int_quot = i / 3 - (int)(i / 3)
i % 3 = 3 * frac_quot = 3 * (i / 3 - (int)(i / 3))
The fractional portion of the mathematical quotient translates directly into the remainder of integer division: If the fraction is 0, the remainder is 0, if the fraction is 1/3 the remainder is 1, if the fraction is 2/3 the remainder is 2. This means we only need to examine the fractional portion of the quotient.
Instead of dividing by 3, we can multiply by 1/3. If we perform the computation in a 32.32 fixed-point format, 1/3 corresponds to 232*1/3 which is a number between 0x55555555 and 0x55555556. For reasons that will become apparent shortly, we use the overestimation here, that is the rounded-up result 0x555555556.
When we multiply 0x55555556 by i, the most significant 32 bits of the full 64-bit product will contain the integral portion of the quotient (int)(i * 1/3) = (int)(i / 3). We are not interested in this integral portion, so we neither compute nor store it. The lower 32-bits of the product is one of the fractions 0/3, 1/3, 2/3 however computed with a slight error since our value of 0x555555556 is slightly larger than 1/3:
i = 1: i * 0.55555556 = 0.555555556
i = 2: i * 0.55555556 = 0.AAAAAAAAC
i = 3: i * 0.55555556 = 1.000000002
i = 4: i * 0.55555556 = 1.555555558
i = 5: i * 0.55555556 = 1.AAAAAAAAE
If we examine the most significant bits of the three possible fraction values in binary, we find that 0x5 = 0101, 0xA = 1010, 0x0 = 0000. So the two most significant bits of the fractional portion of the quotient correspond exactly to the desired modulo values. Since we are dealing with 32-bit operands, we can extract these two bits with a right shift by 30 bits followed by a mask of 0x3 to isolate two bits. I think the masking is needed in Java as 32-bit integers are always signed. For uint32_t operands in C/C++ the shift alone would suffice.
We now see why choosing 0x55555555 as representation of 1/3 wouldn't work. The fractional portion of the quotient would turn into 0xFFFFFFF*, and since 0xF = 1111 in binary, the modulo computation would deliver an incorrect result of 3.
Note that as i increases in magnitude, the accumulated error from the imprecise representation of 1/3 affects more and more bits of the fractional portion. In fact, exhaustive testing shows that the method only works for i < 0x60000000: beyond that limit the error overwhelms the most significant fraction bits which represent our result.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.