What data-structure should I use to create my own "BigInteger" class? - java

As an optional assignment, I'm thinking about writing my own implementation of the BigInteger class, where I will provide my own methods for addition, subtraction, multiplication, etc.
This will be for arbitrarily long integer numbers, even hundreds of digits long.
While doing the math on these numbers, digit by digit isn't hard, what do you think the best datastructure would be to represent my "BigInteger"?
At first I was considering using an Array but then I was thinking I could still potentially overflow (run out of array slots) after a large add or multiplication. Would this be a good case to use a linked list, since I can tack on digits with O(1) time complexity?
Is there some other data-structure that would be even better suited than a linked list? Should the type that my data-structure holds be the smallest possible integer type I have available to me?
Also, should I be careful about how I store my "carry" variable? Should it, itself, be of my "BigInteger" type?

Check out the book C Interfaces and Implementations by David R. Hanson. It has 2 chapters on the subject, covering the vector structure, word size and many other issues you are likely to encounter.
It's written for C, but most of it is applicable to C++ and/or Java. And if you use C++ it will be a bit simpler because you can use something like std::vector to manage the array allocation for you.

Always use the smallest int type that will do the job you need (bytes). A linked list should work well, since you won't have to worry about overflowing.

If you use binary trees (whose leaves are ints), you get all the advantages of the linked list (unbounded number of digits, etc) with simpler divide-and-conquer algorithms. You do not have in this case a single base but many depending the level at which you're working.
If you do this, you need to use a BigInteger for the carry. You may consider it an advantage of the "linked list of ints" approach that the carry can always be represented as an int (and this is true for any base, not just for base 10 as most answers seem to assume that you should use... In any base, the carry is always a single digit)
I might as well say it: it would be a terrible waste to use base 10 when you can use 2^30 or 2^31.

Accessing elements of linked lists is slow. I think arrays are the way to go, with lots of bound checking and run time array resizing as needed.
Clarification: Traversing a linked list and traversing an array are both O(n) operations. But traversing a linked list requires deferencing a pointer at each step. Just because two algorithms both have the same complexity it doesn't mean that they both take the same time to run. The overhead of allocating and deallocating n nodes in a linked list will also be much heavier than memory management of a single array of size n, even if the array has to be resized a few times.

Wow, there are some… interesting answers here. I'd recommend reading a book rather than try to sort through all this contradictory advice.
That said, C/C++ is also ill-suited to this task. Big-integer is a kind of extended-precision math. Most CPUs provide instructions to handle extended-precision math at comparable or same speed (bits per instruction) as normal math. When you add 2^32+2^32, the answer is 0… but there is also a special carry output from the processor's ALU which a program can read and use.
C++ cannot access that flag, and there's no way in C either. You have to use assembler.
Just to satisfy curiosity, you can use the standard Boolean arithmetic to recover carry bits etc. But you will be much better off downloading an existing library.

I would say an array of ints.

An Array is indeed a natural fit. I think it is acceptable to throw OverflowException, when you run out of place in your memory. The teacher will see attention to detail.
A multiplication roughly doubles digit numbers, addition increases it by at most 1. It is easy to create a sufficiently big Array to store the result of your operation.
Carry is at most a one-digit long number in multiplication (9*9 = 1, carry 8). A single int will do.

std::vector<bool> or std::vector<unsigned int> is probably what you want. You will have to push_back() or resize() on them as you need more space for multiplies, etc. Also, remember to push_back the correct sign bits if you're using two-compliment.

i would say a std::vector of char (since it has to hold only 0-9) (if you plan to work in BCD)
If not BCD then use vector of int (you didnt make it clear)
Much less space overhead that link list
And all advice says 'use vector unless you have a good reason not too'

As a rule of thumb, use std::vector instead of std::list, unless you need to insert elements in the middle of the sequence very often. Vectors tend to be faster, since they are stored contiguously and thus benefit from better spatial locality (a major performance factor on modern platforms).
Make sure you use elements that are natural for the platform. If you want to be platform independent, use long. Remember that unless you have some special compiler intrinsics available, you'll need a type at least twice as large to perform multiplication.
I don't understand why you'd want carry to be a big integer. Carry is a single bit for addition and element-sized for multiplication.
Make sure you read Knuth's Art of Computer Programming, algorithms pertaining to arbitrary precision arithmetic are described there to a great extent.

Related

Java - Large array advice on how to break it down [duplicate]

I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?
You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)
Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.
Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.
900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.
If you don't need it all loaded in memory at once, you could segment it into files and store on disk.
What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.
You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).
If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.
Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.
I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.
I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.
Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.
could you get by with 900 million bits? (maybe stored as a byte array).
You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list
Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.

Collections sort method vs iteration

I was working on a playing cards shuffle problem and found two solutions for it.
The target is to shuffle all 52 playing cards stored in a array as Card objects. Card class has id and name associated to it.
Now, one way is to iterate using for loop and then with the help of a temp card object holder and a random number generator, we can swap two objects. This continues until we reach half of the cards.
Another way is to implement comparable overriding compareto method with a random generator number, so we get a random response each time we call the method.
Which way is better you think?
You should not do it by sorting with a comparator that returns random results, because then those random results can be inconsistent with one another (e.g., saying that a<b<c<a), and this can actually result in the distribution of orderings you get being far from uniform. See e.g. this demonstration by Mike Bostock. Also, it takes longer, not that that should matter for shuffling 52 objects.
The standard way to do it does involve a loop, but your description sounds peculiar and I suspect what you have in mind may also not produce the ideal results. (If the question is updated to make it clearer what the "iterate using for loop" approach is meant to be, I will update this.)
(There is a way to get good shuffling by sorting: pair each element up with a random number -- e.g., a random floating-point number in the range 0..1 -- and then sort using that number as key. But this is slower than Fisher-Yates and requires extra memory. In lower-level languages it generally also takes more code; in higher-level languages it can be terser; I'd guess that for Java it ends up being about equal.)
[EDITED to add:] As Louis Wasserman very wisely says in comments, when your language's standard library has a ready-made function to do a thing, you should generally use it. Unless you're doing this for, e.g., a homework assignment that requires you to find and implement an algorithm to solve the problem.
First of all, the comparator you've described wont work. More on this here. TLDR: comparsions must be reproducible, so if your comparator says that a is less then b next time when comparing b to a it should return "greater", not a random value. The same for Comparable.
If I were you, I'd rather use Collections#shuffle method, which "randomly permutes the specified list using a default source of randomness. All permutations occur with approximately equal likelihood". It's always better to rely on someone's code, then write your own, especially if it is in a standard library.

Is there a way to efficiently store a sequence of enum values in Java?

I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.
You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.
One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.

Performance implications of using Java BigInteger for a huge bitmask

We have an interesting challenge. We have to control access to data that reside in "bins". There will be, potentially, hundreds of thousands of "bins". Access to each bin is controlled individually but the restrictions can, and probably will, overlap. We are thinking of assigning each bin a position in a bitmask (1,2,3,4, etc..).
Then when a user logs into the system, we look at his security attributes and determine which bins he's allowed to see. With that info we construct a bitmask for this user where the "set" bits correspond to the identifier of the bins he's allowed to see. So if he can see bins 1, 3 and 4, his bit mask would be 1101.
So when a user searches the data, we can look at the bin index of the returned row and see if that bit is set on his bitmask. If his bitmask has that bit set we let him see that row. We are planning for the bitmask to be stored as a BigInteger in Java.
My question is: Assuming the index number doesn't get bigger that Integer.MAX_INT, is a BigInteger bitmask going to scale for hundreds of thousands of bit positions? Would it take forever to run BigInteger.isBitSet(n) where n could be huge (e.g. 874,837)? Would it take forever to create such a BigInteger?
And secondly: If you have an alternative approach, I'd love to hear it.
BigInteger should be fast if you don't change it often.
A more obvious choice would be BitSet which is designed for this sort of thing. For looking up bits, I suspect the performance is similar. For creating/modifying it would be more efficient to use a BitSet.
Note: PaulG has commented the difference is "impressive" and BitSet is faster.
Java has a more convenient class for this, called BitSet.
You do not need to check if the bit is set in a loop: you can make a mask, use a bitwise and, and see if the result is non-empty to decide on whether to grant or deny the access:
BitSet resourceAccessMask = ...
BitSet userAllowedAccessMask = ...
BitSet test = (BitSet)resourceAccessMask.clone();
test.and(userAllowedAccessMask);
if (!test.isEmpty()) {
System.out.println("access granted");
} else {
System.out.println("access denied");
}
We used this class in a similar situation in my prior company, and the performance was acceptable for our purposes.
You could define your own Java interface for this, initially using a Java BitSet to implement that interface.
If you run into performance issues, or if you require the use of long later on, you may always provide a different implementation (e.g. one that uses caching or similar improvements) without changing the rest of the code. Think well about the interface you require, and choose a long index just to be sure, you can always check if it is out of bounds in the implementation later on (or simply return "no access" initially) for anything index > Integer.MAX_VALUE.
Using BigInteger is not such a good idea, as the class was not written for that particular purpose, and the only way of changing it is to create a fully new copy. It is efficient regarding memory use; it uses an array consisting 64 bit longs internally (at the moment, this could of course change).
One thing that should be worth considering (beside using BitSet) is using different granularity. Therefore you use a shorter bit set where each bit 'guards' multiple real bits. This way you would not need to have millions of bits per user in ram.
A simple way to achieve this is having a smaller bit set like n/32 and do something like this:
boolean isSet(int n) {
return guardingBits.isSet(n / 32) && realBits.isSet(n);
}
This gives you a good chance to avoid loading the real bits if those bits are mostly zero. You can modify this approach to match the expected bit-set. If you expect almost all bits are set you can use this guarding bits for storing a one if all bits it guards are set. So you only need to check for bits that might be zero.
Also this might be even the beginning. Depending on the usage and requirements you might want to use a B-tree or a paginated version where you only held a fraction of the big bit field in memory.

Fastest sort for small collections

Many times I have to sort large amounts of small lists, arrays. It is quite rare that I need to sort large arrays. Which is the fastest sort algorithm for sorting:
arrays
(array)lists
of size 8-15 elements of these types:
integer
string from 10-40 characters
?
I am listing element types because some algorithms do more compare operations and less swap operations.
I am considering Merge Sort, Quick Sort, Insertion Sort and Shell sort (2^k - 1 increment).
Arrays.sort(..) / Collections.sort(..) will make that decision for you.
For example, the openjdk-7 implementation of Arrays.sort(..) has INSERTION_SORT_THRESHOLD = 47 - it uses insertion sort for those with less than 47 elements.
Unless you can prove that this is a bottleneck then the built in sorts are fine:
Collections.sort(myIntList);
Arrays.sort(myArray);
Actually, there isn't a universal answer. Among other things, the performance of a Java sort algorithm will depend on the relative cost of the compare operation, and (for some algorithms) on the order of the input. In the case of a list, it also depends on the list implementation type.
But #Bozho's advice is sound, as is #Sean Patrick Floyd's comment.
FOLLOWUP
If you believe that the performance difference is going to be significant for your use-case, then you should get hold of some implementations of different algorithms, and test them out using the actual data that your application needs to deal with. (And if you don't have the data yet, it is too soon to start tuning your application, because the sort performance will depend on actual data.)
In short, you'll need to do the benchmarking yourself.

Categories