Compress array of numbers

Compress array of numbers - java

I have a large array (~400.000.000 entries) with integers of {0, 1, ..., 8}.
So I need 4 bits per entry. Around 200 MB.
At the moment I use a byte-array and save 2 numbers in each entry.
I wonder, if there is a good method, to compress this array. I did a quick research and found algorithms like Huffmann or LZW. But these algorithms are all for compressing the data, send the compressed data to someone and decompress them.
I just want to have a table, with less memory space, so I can load it into the RAM. The 200MB table easily fits, but I'm thinking on even bigger tables.
Important is, that I still be able to determine the values on certain positions.
Any tips?
Further information:
I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value.
There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights.
So more than half of the numbers are 6s.
I only need a fast getValue(idx) funktion, setValue(idx, value) is not important.

It depends on how your data look like. Are there repeating entries, or do they change only slowly, or what?
If so, you can try to compress chunks of your data and decompress when needed. The bigger the chunks, the more memory you can save and the worse the speed. IMHO no good deal. You could also save the data compressed and decompress in memory.
Otherwise, i.e., in case of no regularities, you'll need at least log(9) / log(2) = 3.17 bits per entry and there's nothing what could improve it.
You can come pretty close to this value by packing 5 numbers into a short. As 9**5 = 59049 < 65536 = 2**16, it fits nearly perfectly. You'll need 3.2 bits per number, no big win. Packing of five number is given via this formula
a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))
and unpacking is trivial via a precomputed table.
UPDATE after question update
Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.
The fact that there are on the average about 2.14 consecutive numbers are the same could lead to some compression, but in this case it says us nothing. There are nearly only fives and sixes, so encountering two equal consecutive numbers seems to be implied.
Given this facts, you can forget my above optimization. There are practically only 8 values there as you can treat the single zero separately. So you need just 3 bits per value and a single index for the zero.
You can even create a HashMap for all values below four or above seven, store there 1+154+10373+385990+80 entries and use only 2 bits per value. But this is still far from ideal.
Assuming no regularities, you'd need 1.44 bit per value as this is the entropy. You could go over all 5-tuples, compute their histogram, and use 1 byte for encoding of the 255 most frequent tuples. All the remaining tuples would map to the 256th value, telling you that you have to look in a HashMap for the rare tuple value.
Some evaluation
I was curious if it can work. The packing of 5 numbers into one byte needs 85996340 bytes. There are nearly 5 million tuples which don't fit, so my idea was to use a hash map for them. Assuming rehashing rather than chaining it makes sense to keep it maybe 50% full, so we need 10 million entries. Assuming TIntShortHashMap (mapping indexes to tuples) each entry takes 6 bytes, leading to 60 MB. Too bad.
Packing only 4 numbers into one byte consumes 107495425 bytes and leaves 159531 tuples which don't fit. This looks better, however, I'm sure the denser packing could be improved a lot.
The results as produced by this little program:
*** Packing 5 numbers in a byte. ***
Normal packed size: 85996340.
Number of tuples in need of special handling: 4813535.
*** Packing 4 numbers in a byte. ***
Normal packed size: 107495425.
Number of tuples in need of special handling: 159531.

There are many options - most depend on how your data looks. You could use any of the following and even combinations of them.
LZW - or variants
In your case a variant that uses a 4-bit initial dictionary would probably be a good start.
You could compress your data in blocks so you could use the index requested to determine which block to decode on the fly.
This would be a good fit if there are repeating patterns in your data.
Difference Coding
Your edit suggests that your data may benefit from a differencing pass. Essentially you replace every value with the difference between it and its predecessor.
Again you would need to treat your data in blocks and difference fixed run lengths.
You may also find that using differencing following by LZW would be a good solution.
Fourier Transform
If some data loss would be acceptable then some of the Fourier Transform compression schemes may be effective.
Lossless JPEG
If your data has a 2-dimensional aspect then some of the JPEG algorithms may lebd themselves well.
The bottom line
You need to bear in mind:
The longer time you spend compressing - up to a limit - the better compression ratio you can achieve
There is a real practical limit to how far you can go with lossless compression.
Once you go lossy you are essentially no longer restricted. You could approximate the whole of your data with new int[]{6} and get quite a few correct results.

As more than 1/2 of the entries are sixes, then just encode those as a single bit. Use 2 bits for the second most common and so on. Then you have something like this:
encoding total
#entrie bit pattern #bits # of bits
zero 1 000000001 9 9
ones 154 0000001 7 1078
twos 10373 000001 6 62238
threes 385990 00001 5 1929950
fours 8146188 0001 4 32584752
fives 85008968 01 2 170017936
sixes 265638366 1 1 265638366
sevens 70791576 001 3 212374728
eights 80 00000001 8 640
--------------------------------------------------------
Total 682609697 bits
With 429981696 entries encoded with 682609697 bits, you would then need 1.59 bit per entry on average.
Edit:
To allow for fast lookup, you can make an index into the compressed data that show where every n entry starts. Finding the exact value would then involve decompressing on average n/2 entries. Depending on how fast it should be you can adjust the number of entries in the index. To reduce the size of the pointer into the compressed data (and those the size of the index), use an estimate and just store the offset from that estimate.
Estimated pos Offset from
# entry no Actual Position (n * 1.59) estimated
0 0 0 0
100 162 159 3 Use this
200 332 318 14 <-- column as
300 471 477 -6 the index
400 642 636 6
500 807 795 12
600 943 954 -11
The overhead for such an index with every 100 entry and 10 bits for the offset, would mean 0.1 bit extra per entry.

There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours,
85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights
Total = 429981696 symbols
Assuming the distribution is random, the entropy theorem says you cannot do better than 618297161.7 bits ~ 73.707 MB or on average 1.438 bits / symbol.
Minimum number of bits is SUM(count[i] * LOG(429981696 / count[i], 2)).
You can achieve this size using a range coder.
Given Sqrt(N) = 20736
Again you can achieve O(Sqrt(N)) complexity for accessing a random element by saving an int[k = 0 .. CEIL(SQRT(N)) - 1] state with the arithmetic decoder state after each SQRT(N) decoded symbols. This allows fast decoding of the next 20736 block of symbols.
The complexity of accessing an element drops to O(1) if you access the memory stream in a linear way.
Additional memory used: 20736 * 4 = 81KB.

How about considering some caching solution, like mapdb, or apache jcs. This will enable you to persist the Collection to disk, thus enabling you to work with very large lists.

You should look into a BitSet to store it most efficiently. Contrary to what the name suggests, it is not exactly a set, it has order and you can access it per index.
Internally it uses an array of longs to store the bits and hence can update itself by using bit masks.
I don't believe you can store it any more efficiently natively, if you want even more efficiency, then you should consider packing/compression algorithms.

Related

Searching a file for unknown integer with minimum memory requirement [duplicate]

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.

Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).

Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.

A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.

Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)

For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.

If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break

Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.

This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.

EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.

If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)

They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)

Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.

Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.

If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.

If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;

Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.

Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.

The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.

Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.

For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.

Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}

For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.

I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}

As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.

Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.

2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.

I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}

As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.

If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Data structure recommendation

Developing in Java, I need a data structure to select N distinct random numbers between 0 and 999999 ?
I want to be able to quickly allocate N numbers and make sure they don't repeat themselves.
Main goal is not to use too much memory and still keep performance reasonable.
I am considering using a BitSet But I am not sure if the memory implications.
Can someone tell me if the memory requirements of this class are related to the number of bits or to the number of set bits? and what is the complexity to setting/testing a bit ?
UPDATE:
Thanks for all the replies so far.
I Think I had this in my initial wording of this Q but removed it when I first saw the BitSet Class.
Anyway I wanted to add the following info:
Currently I am looking at N of a few thousands at most (most likely around 1000-2000) and a number range of 0 to 999999.
But I would like my choice to take into consideration the option of increasing the range to 8 digits (i.e. 0 to 99 999 999) while keeping N at roughly the same ranges (maybe increase it to 5K or 10K).
So the "used values" are quite sparse.

It depends on how large N is.
For small values of N, you could use a HashSet<Integer> to hold the numbers you have already issued. This gives you O(1) lookup and O(N) space usage.
A BitSet for the range 0-999999 is going to use roughly 125Kb, regardless of the value of N. For large enough values of N, this will be more space efficient than a HashSet. I'm not sure exactly what the value of N is where a BitSet will use less space, but my guestimate would be 10,000 to 20,000.
Can someone tell me if the memory requirements of BitSet are related to the number of bits or to the number of set bits?
The size is determined either by the largest bit that has ever been set, or the nBits parameter if you use the BitSet(int nBits) constructor.
and what is the complexity to setting/testing a bit ?
Testing bit B is O(1).
Setting bit B is O(1) best case, and O(B) if you need to expand the bitset backing array. However, since the size of the backing array is the next largest power of 2, the cost of expansion can typically be amortized over multiple BitSet operations.

A BitSet will take up as much space as 1,000,000 booleans, which is 125,000 bytes or roughly 122kB, plus some minor overhead and space to grow. An array of the actual numbers, i.e. an int[] will take N × 4B of space plus some overhead. The break-even point is
4 × N = 125,000
N = 31250
I'm not intimately familiar with Java internals, but I suspect it won't allocate more than twice the actual space used, so you're using less then 250kB of memory with a bitset. Also, an array makes it harder to find the duplicates when you need unique integers, so I'd use the bitset either way and perhaps convert it to an array at the end, if that's more convenient for further processing.
Setting/getting a bit in a BitSet will have constant complexity, although it takes a few more operations than getting one out of a boolean[].

jvm heap setting pattern

I have observed while setting heap size people prefer the values 64,128,256,1024.. . If I give a value in- between these numbers (say 500), won't the JVM accept that value? Why these numbers are important and preferred? Why we also upgrade RAM in this pattern?
Please help me to understand.

JVM will accept any value, no problem with that. Using 2^n values is just a "convention", using others will have no negative effect in practice.

Well, if you think about it this way:
1 byte is 8 bits
1 kb = 1024 bytes
1 mb = 1024 kb
1 gb = 1024 mb
... and so on ...
It's not just 2^n. Things in terms of memory in computing are closely related to the number eight - the number which defines one byte in most modern computers.
The main reason why bits are grouped together is to represent characters. Because of the binary nature of all things computing, ideal 'clumps' of bits come in powers of 2 i.e. 1, 2, 4, 8, 16, 32.... (basically because they can always be divided into smaller equal packages (it also creates shortcuts for storing size, but that's another story)). Obviously 4 bits (nybble in some circles) can give us 2^4 or 16 unique characters. As most alphabets are larger than this, 2^8 (or 256 characters) is a more suitable choice.
Machines exist that have used other length bytes (particularly 7 or 9). This has not really survived mainly because they are not as easy to manipulate. You certainly cannot split an odd number in half, which means if you were to divide bytes, you would have to keep track of the length of the bitstring.
Finally, 8 is also a convenient number, many people (psychologists and the like) claim that the human mind can generally recall only 7-8 things immediately (without playing memory tricks).

If it won't accept the value, check whether you put a megabytes(M or m) or gigabytes(G or g) modifier after the amount.
Example: java -Xms500M -Xmx500M -jar myJavaProgram.jar
Also, take a look at this link.

Why we also upgrade RAM in this pattern.
That is because memory chips / cards / come in sizes that are a power of 2 bytes. And the fundamental reason for that is that it makes the electronics simpler. And simpler means cheaper, more reliable and (probably) faster.

Except non-written convention, it has also performance impact - depending on the the architecture of the machine.
For example if a machine is ternary based, it would work better with a heap size set to a value which is a power of 3.

N-way merge sort a 2G file of strings

This is another question from cracking coding interview, I still have some doubt after reading it.
9.4 If you have a 2 GB file with one string per line, which sorting algorithm
would you use to sort the file and why?
SOLUTION
When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.
Doubt:
When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong?
Thanks!

http://en.wikipedia.org/wiki/External_sorting
A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.

First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.
And as to the storage required, there are two possibilities.
The first is to merge the sorted data in groups of two. Say you have three groups:
A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7
With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:
Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).
This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.
The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).
This involves only one merge operation but it requires more storage, basically one element per list.
Which of these you choose depends on the available memory and the element size.
For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.
So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

Find missing number in unsorted input with memory constraints

I was reading about the problem of finding the missing number from a series of 4 billion 32 bit integers in Programming Pearls, but could not quite get the solution.
Given a sequential file that contains at most 4 Billion 32-bit
integers in random order, find a 32-bit integer not on the file.
Constraint: Few hundrends of bytes of main memory but ample use of
extrernal scratch files on disk
The solution described is a process where we separate the numbers in ranges using the the upper bits (I.e. in the first pass we write those with leading 0 bit to one file and those with leading 1 bit to another.We keep going using the second bit etc.) and divide in half using as new search range the half containing less than half of the numbers in the range.
I googled and found a similar post in SO which does not quite answer my question which is how the exact number is found (I undestand how the binary search fits in separate the ranges).
The answer of #Damien_The_Unbeliever seems the most relevant but from my point of view I thought that following the process we would end up with 2 files: A file with 2 numbers in range and a file with 1 number.
By subtracting the (one) number in one file with the largest of the others we can get a missing number (no need of bitmask and I am not quite sure if we could actually apply a bitmask since the range is unknown at any point).
Am I wrong on this? Could someone help figure this out?

You needn't copy the data itself or write anything to disk; just count the members of some partition of the data to identify openings. The tradeoff is between number of passes and memory (more memory allows for more counts, smaller partitions).
Here's a solution in 8 passes. We'll partition the data using 4 bits at a time. 2^4 = 16 possible values, so we'll need 64 bytes to store counts for each of the 16 partitions. So each 4 bit nibble value has an associated count.
Make a pass through the data, incrementing the associated count matching the nibble in the first four bits of the number. If a partition is full, its count will be 2^28. Pick one of the nibbles that isn't full --- this will be the first nibble of your result.
Zero your counts and make another pass, ignoring numbers that don't match the first nibble and incrementing the count associated with the second nibble in the number. A full second nibble will have a value of 2^24. Pick one that isn't full.
Proceed in this manner until you have all 8 nibbles, and there's your answer in O(N).
If you only check one bit at a time, this would be a binary search requiring 32 passes. (EDIT: Not really a binary search, since you still have to read the values that you're skipping. That's why it's O(N). See edit below.) If you have a KB of memory for counts, you can do it in 4 passes; with 256 KB you can do it in 2 --- actually 128 KB since you could then use short ints for your counts. Here we're constrained to a few hundred bytes --- maybe 6 bits/6 passes/256 bytes?
EDIT: Li-aung Yip's solution scales better, and clearly it can be modified to partition by more than one bit in a pass. If the writing is slower than reading, then maybe the best solution would be a hybrid between this read-only answer and Li-aung Yip's disk based one.
Do a pass counting nibbles as above, then as you count the second set of nibbles, write only the numbers (or possibly just the last 28 bits of them) that match the first nibble, into 16 files according to the second nibble.
Pick your second nibble and read it to get counts for the third nibble, writing only the numbers matching the second nibble, etc. If the file is close to capacity, if the numbers are fairly uniformly distributed, or if you pick the least-full nibbles each time, you'll have to write no more than about 6.66% (1/16+1/16^2...) of the file size.

After repeated binary partitioning of your numbers into smaller and smaller files, you'll end up with:
a bunch of files that contain two numbers, which only differ in their last significant bit
one file with only one number in it.
Get the missing number by flipping the last bit of the number that's in a file by itself.
Take the example of the numbers from 0x00 to 0x07, missing 0x04:
000
001
010
011
... (missing)
101
110
111
Take 101, flip the least significant bit, and get 100, which is the missing 0x04.

4 Billion integers are representable using a 32 bit integer. XOR ing a number with itself is a standard trick to zero out a register in assembly code.
If you are guaranteed that only one number is missing, bitwise XOR on integers comes to the rescue, an O(N) solution, requiring only one additional 32 bit integer of space. Consider a simple example, a 3 bit number, thus numbers 0-7 representable and one of them is missing.
Assume 6 (110) is missing
missing = n1 XOR n2 XOR n3 XOR .. XOR n7.
= 000 XOR 001 XOR 010 XOR 011 XOR 100 XOR 101 XOR 111
Had the problem been find the missing number between 1 and 100, you would need to start of my xoring out the elements that must be excluded. AND could be used drop from a 32 bit integer range to smaller ranges by masking out bits in the number.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.