Java - Space complexity with variable - java

Is the space complexity here O(n)? Since if k increases by 5, my variable p would also increase by 5.
All this method does right now is get the node at k. For example: 1->5->3, when k = 2, the node is 5.
public ListNode reverseKGroup(ListNode head, int k) {
int p = 1;
while (p < k) {
if (head.next == null) {
return head;
}
head = head.next;
p++;
}
return head
}

Strictly considering your algorithm, it has a space complexity O(1). Your input is a header of a list and a number k, but your algo doesn't consume any space more than just a reference head and a number p. In my opinion, the existing list doesn't belong to the complexity of your method. However, your time complexity is O(N).
--- answer to Theo's question in the comment:
p is a number (in this case of primitive type int, so it takes 4 bytes - constant size). If p increases, it doesn't mean, it takes more space, but that a higher number is stored in. E.g. p = 5 means following bytes are set: "0,0,0,5" , for p = 257, bytes are set: "0,0,1,2".
I assume, JVM stores the date in big endian byte order, so the first zero's are representing the bigger bytes. With little endian, the byte order would be reversed.
Of course, you are right, that for very big N, the 32 bits long integer is not enought. Therefore, strictly considering this fact, O(log(N)) bits are necessary to store numbers up to N.
E.g. a number 2^186 needs 187 bits to be stored (one 1 and 186 zeros).
But in reality, when working with "usual" data, you do not expect such a huge amount. Since only to exceed 32 bit register (one int number), you would need to have 2^32 data entries (1 entry = 4 bytes for a next reference, at least 4 bytes for the value Object reference, and the object size itself = at least 8 bytes), that is 2^35 bytes = 32 gigabyes. Therefore, when a number is used, it's generally considered to be a constant space complexity. It depends on the task and circumstances.

Depending on whether you consider pre-existing structures part of your space complexity, the space complexity is either O(1) or O(N) where N is the length of the list being operated on since you do not add any new nodes and only reference existing nodes.
k only matters for time complexity.

The only space this algorithm uses is the space for int p, which is constant regardless of input so space complexity is O(1). The time complexity indeed is O(N).

Related

Why is this memoization faster with an array than with a map?

I was solving combination sum IV on leetcode (#377), which reads:
"Given an integer array with all positive numbers and no duplicates, find the number of possible combinations that add up to a positive integer target."
I solved it in Java using a top down recursive approach with a memoization array:
public int combinationSum4(int[] nums, int target){
int[] memo = new int[target+1];
for(int i = 1; i < target+1; i++) {
memo[i] = -1;
}
memo[0] = 1;
return topDownCalc(nums, target, memo);
}
public static int topDownCalc(int[] nums, int target, int[] memo) {
if (memo[target] >= 0) {
return memo[target];
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownCalc(nums, target - num, memo);
}
}
memo[target] = tot;
return tot;
}
Then I figured I was wasting time by initializing the entire memo array and could just use a Map instead (which would also save space / memory). So I rewrote the code as follows:
public int combinationSum4(int[] nums, int target) {
Map<Integer, Integer> memo = new HashMap<Integer, Integer>();
memo.put(0, 1);
return topDownMapCalc(nums, target, memo);
}
public static int topDownMapCalc(int[] nums, int target, Map<Integer, Integer> memo) {
if (memo.containsKey(target)) {
return memo.get(target);
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownMapCalc(nums, target - num, memo);
}
}
memo.put(target, tot);
return tot;
}
I am confused though, because after submitting the second version of my code Leetcode said it was slower and used more space than the first code. How does the HashMap use more space and run slower than an array whos values all had to be initialized and whos length is greater than the HashMaps size?
Then I figured I was wasting time by initializing the entire memo array
You could have stored 'answer + 1' instead, so that the default value (0) can now be a placeholder for 'not calculated yet', and save that initialization. Not that it is expensive. Let's dig into cache pages.
Cache pages
CPUs are complex beasts. They don't operate on memory directly; not anymore. They literally cannot do it; the chip's calculating parts are simply not hooked up. at all. Instead, the CPU has caches, which come in set sizes (for example, 64k - you can't have a single cache node hold more or less than precisely 64k, and that entire 64k is then considered to be a cached copy of some 64k segment of main memory). One such node is called a cache page.
The CPU can only operate on cache pages.
In java, int[] leads to a contiguous, straight sized chunk of memory representing the data. In other words, an int[] x = new int[1000] would declare a single chunk of memory of 1000*4 = 4000 bytes (because ints are 4 bytes, and you reserved room for 1000 of em). That fits inside a single page. So, when you write your loop to initialize the values to -1, that's asking the CPU to loop through a single cache page and write some data to it. CPUs have pipelines and other speedup factors; this costs maybe 250 cycles.
Contrast to the cost of fetching a cache page: The CPU will be twiddling its thumbs (which is good; it can cool down some, and on modern hardware, often the CPU is limited not by its raw speed capabilities, but by the ability of the system to wick away the thermal impact of having it run! - it can also spend time on other threads/processes) whilst it farms out the job of fetching some chunk of memory into a cache page to the memory controller. Nevertheless, that thumb twiddling takes on the order of magnitude of 500 cycles or more. It's nice the CPU gets to cool down or focus on other things during it, but it's still the case that writing 4000 contiguous bytes in a tight loop is faster than a single cache miss.
Thus, 'fill a 1000-large int array with -1s' is an extremely cheap operation.
Wrapper objects
maps operate on objects, not ints, which is why you had to write Integer and not int. an Integer, in memory at least, is a much, much larger load on memory. It's an entire object, containing an int field. Then, your variable (or your map) holds a pointer to it.
So, an int[] x = new int[1000] takes 4000 bytes, plus some change for the object headers (maybe add 12 bytes to it all), and 1 reference (depends on VM, but let's say 64 bit), for a grand total of 4020 bytes.
In contrast,
Integer[] x = new Integer[1000];
for (int i = 0; i < 1000; i++) x[i] = i;`
is much, much larger. It's 1000 pointers (can be as large as 8 bytes per pointer, or as small as 4. So, 4000 to 8000 bytes), to 1000 separate integer objects. Each integer object gets the object overhead (~12 bytes or more), + 1 integer field, generally word-aligned (so, 64-bits, even though it's only 32-bit, assuming a 64-bit VM running on 64-bit hardware, which is going to be the case on anything modern), for another 20000 bytes. A grand total of something closer to 30000 bytes.
That is about 8x more memory required.
Then consider that the 'key' in your memoized array is inherent (it's the index into the array) whereas in the map the key needs separate storage and it gets worse still: Each k/v pair in your map occupies at least 12+12+8+8+8+8 bytes (2 object overheads and 2 int fields for your key and value Integer objects, and 2 pointers for the map to point at these), 56 bytes. In contrast to your int[] which does it in 4.
That gives you a rate of 56/4 = 14.
If your map contains only 1 in 14 numbers, then the map should be about as large as your array, because the map can do a thing your array can't: The array is as large as it has to be from the get-go, the map only needs to store required nodes.
Still, one would assume for most 'interesting' inputs, the coverage factor of that map is going to be far north of 7.14%, thus resulting in the map being larger.
The map also has its objects smeared out over memory which risks them being in more than one cache page: Large memory load + fragmentation = an easy road to having the CPU wait for multiple cache page fetches vs. being able to do all the work in one go, never having to wait for cache misses.
Can it be faster?
Yeah, probably - but with map occupancy rates at 10% or higher, the concept of using a map to save space is dubious. If you want to try, you'd need a map specifically designed to hold ints and nothing else. These do exist, such as eclipse collections' IntIntMap.
But I bet in this case the simple array memoization strategy is just the winner, even if you use IntIntMap.
These things came to my mind first:
HashMap is what the name implies, a hash-based map. Sowhenever you put something into it or get something out of it, it has to hash the key, then find the target based on that hash.
put() operation isn't just a walk in the park, either - you can check here to get an idea what it does. Definitely more than array assignment.
in java it doesn't work with primitives, so for each value you have to convert ints to Integers and vice versa. (as noted by others, there are int-specialized map alternatives available, but not in standard lib)
aaand since you're not initializing it, it might need to resize internally several times during your run - default size for a hashmap is just 16 - which is definitely more expensive then one-shot initialization you did with array. here's what each resizing does.
it also works with Entry objects that it needs for each internal entry it's got, and all those objects also take some space, plenty more than just having an array of integers
So I wouldn't think a hashmap would save you neither space or time. Why would it?

Big-O Memory of Array vs String

This might sound dumb but I was wondering while thinking a bit, can't you play around with an algorithm and make O(n) memory seem O(1)?
(Java)
Let's say you have an array of N elements of true or false.
Then that array would result in O(n) memory.
However, if we have an array of say, "FFFFFTFTFFT" with each charAt(i) answering the result of the i-th index of the array, haven't we used only O(1) memory or is it considered O(n) memory since String is size of O(n) itself?
Let's take this further.
If we have an N-array of true and false and convert this to bytes, we use even less memory. Then is the byte also considered O(1) memory or O(n) memory?
For instance, let's say n = 6. Then array size is 6 = O(n). But the byte size is just 1 byte since 1 byte can store 8 different values (8 bits). So is this O(1) or is this O(n) since for large N we get the following case...:
N equals 10000. Array is O(n) memory but is a byte what memory? Cause our byte is O(n/8) = O(n)?
All the cases you've described are O(n), it describes the limiting behavior when n tends towards infinity, saying mathematically:
f(n) = O(n), as n -> INF equals to f(n)/n -> const, as n -> INF, where const <> 0
So 10*n + 100 = O(n) and 0.1*n = O(n).
And as you wrote next statement is correct too: O(n/8) = O(n) = O(n/const)
I'm not sure you understand the concepts of Big O completely, but you still have N elements in each of the listed cases.
The notation O(N) is an upper bound for a function of N elements, not so much defined by the size of the underlying datatypes, since as noted O(N/8) = O(N).
So, example,
If we have an N-array of true and false and convert this to bytes
You are converting N elements into N bytes. This is O(N) time complexity. You stored 2 * O(N) total arrays, resulting in O(N) space complexity.
charAt(i)
This operation alone is O(1) time complexity because you are accessing one element. But you have N elements in an array or string, so it is O(N) space complexity
I'm not really sure there is a common O(1) space complexity algorithm (outside of simple math operations)
There is another misconception here: in order to truly make a "character container" with that O(1) property (respectively: O(log n), as the required memory still grows with growing data), that would only work for exactly that: strings that contain n characters of one kind, and 1 character of another kind.
In such cases, yes, you would only need to remember the index that has the different character. That is similar to defining a super sparse matrix: if only one value is != 0 in a huge matrix, you could store only the corresponding indexes instead of the whole matrix with gazillions of 0 values.
And of course: there are libraries that do such things for sparse matrixes to reduce the cost of keeping known 0 values in memory. Why remember something when you can (easily) compute it?!

Why use 1<<4 instead of 16?

The OpenJDK code for java.util.HashMap includes the following line:
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
Why is 1 << 4 used here, and not 16? I'm curious.
Writing 1 << 4 instead of 16 doesn't change the behavior here. It's done to emphasize that the number is a power of two, and not a completely arbitrary choice. It thus reminds developers experimenting with different numbers that they should stick to the pattern (e.g., use 1 << 3 or 1 << 5, not 20) so they don't break all the methods which rely on it being a power of two. There is a comment just above:
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
No matter how big a java.util.HashMap grows, its table capacity (array length) is maintained as a power of two. This allows the use of a fast bitwise AND operation (&) to select the bucket index where an object is stored, as seen in methods that access the table:
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) { /// <-- bitwise 'AND' here
...
There, n is the table capacity, and (n - 1) & hash wraps the hash value to fit that range.
More detail
A hash table has an array of 'buckets' (HashMap calls them Node), where each bucket stores zero or more key-value pairs of the map.
Every time we get or put a key-value pair, we compute the hash of the key. The hash is some arbitrary (maybe huge) number. Then we compute a bucket index from the hash, to select where the object is stored.
Hash values bigger than the number of buckets are "wrapped around" to fit the table. For example, with a table capacity of 100 buckets, the hash values 5, 105, 205, would all be stored in bucket 5. Think of it like degrees around a circle, or hours on a clock face.
(Hashes can also be negative. A value of -95 could correspond to bucket 5, or 95, depending on how it was implemented. The exact formula doesn't matter, so long as it distributes hashes roughly evenly among the buckets.)
If our table capacity n were not a power of two, the formula for the bucket would be Math.abs(hash % n), which uses the modulo operator to calculate the remainder after division by n, and uses abs to fix negative values. That would work, but be slower.
Why slower? Imagine an example in decimal, where you have some random hash value 12,459,217, and an arbitrary table length of 1,234. It's not obvious that 12459217 % 1234 happens to be 753. It's a lot of long division. But if your table length is an exact power of ten, the result of 12459217 % 1000 is simply the last 3 digits: 217.
Written in binary, a power of two is a 1 followed by some number of 0s, so the equivalent trick is possible. For example, if the capacity n is decimal 16, that's binary 10000. So, n - 1 is binary 1111, and (n - 1) & hash keeps only the last bits of the hash corresponding to those 1s, zeroing the rest. This also zeroes the sign bit, so the result cannot be negative. The result is from 0 to n-1, inclusive. That's the bucket index.
Even as CPUs get faster and their multimedia capabilities have improved, integer division is still one of the most expensive single-instruction operations you can do. It can be 50 times slower than a bitwise AND, and avoiding it in frequently executed loops can give real improvements.
I can't read the developer's mind, but we do things like that to indicate a relationship between the numbers.
Compare this:
int day = 86400;
vs
int day = 60 * 60 * 24; // 86400
The second example clearly shows the relationship between the numbers, and Java is smart enough to compile that as a constant.
I think the reason is that the developer can very easy change the value (according to JavaDoc '/* The default initial capacity - MUST be a power of two. */') for example to 1 << 5 or 1 << 3 and he doesn't need to do any calculations.

Determining the element that occurred the most in O(n) time and O(1) space

Let me start off by saying that this is not a homework question. I am trying to design a cache whose eviction policy depends on entries that occurred the most in the cache. In software terms, assume we have an array with different elements and we just want to find the element that occurred the most. For example: {1,2,2,5,7,3,2,3} should return 2. Since I am working with hardware, the naive O(n^2) solution would require a tremendous hardware overhead. The smarter solution of using a hash table works well for software because the hash table size can change but in hardware, I will have a fixed size hash table, probably not that big, so collisions will lead to wrong decisions. My question is, in software, can we solve the above problem in O(n) time complexity and O(1) space?
There can't be an O(n) time, O(1) space solution, at least not for the generic case.
As amit points out, by solving this, we find the solution to the element distinctness problem (determining whether all the elements of a list are distinct), which has been proven to take Θ(n log n) time when not using elements to index the computer's memory. If we were to use elements to index the computer's memory, given an unbounded range of values, this requires at least Θ(n) space. Given the reduction of this problem to that one, the bounds for that problem enforces identical bounds on this problem.
However, practically speaking, the range would mostly be bounded, if for no other reason than the type one typically uses to store each element in has a fixed size (e.g. a 32-bit integer). If this is the case, this would allow for an O(n) time, O(1) space solution, albeit possibly too slow and using too much space due to the large constant factors involved (as the time and space complexity would depend on the range of values).
2 options:
Counting sort
Keeping an array of the number of occurrences of each element (the array index being the element), outputting the most frequent.
If you have a bounded range of values, this approach would be O(1) space (and O(n) time). But technically so would the hash table approach, so the constant factors here is presumably too large for this to be acceptable.
Related options are radix sort (has an in-place variant, similar to quicksort) and bucket sort.
Quicksort
Repeatedly partitioning the data based on a selected pivot (through swapping) and recursing on the partitions.
After sorting we can just iterate through the array, keeping track of the maximum number of consecutive elements.
This would take O(n log n) time and O(1) space.
As you say maximum element in your cache may e a very big number but following is one of the solution.
Iterate over the array.
Lets say maximum element that the array holds is m.
For each index i get the element it holds let it be array[i]
Now go to the index array[i] and add m to it.
Do above for all the indexes in array.
Finally iterate over the array and return index with maximum element.
TC -> O(N)
SC -> O(1)
It may not be feasible for large m as in your case. But see if you can optimize or alter this algo.
A solution on top off my head :
As the numbers can be large , so i consider hashing , instead of storing them directly in array .
Let there are n numbers 0 to n-1 .
Suppose the number occcouring maximum times , occour K times .
Let us create n/k buckets , initially all empty.
hash(num) tells whether num is present in any of the bucket .
hash_2(num) stores number of times num is present in any of the bucket .
for(i = 0 to n-1)
if the number is already present in one of the buckets , increase the count of input[i] , something like Hash_2(input[i]) ++
else find an empty bucket , insert input[i] in 1st empty bucket . Hash(input[i]) = true
else , if all buckets full , decrease count of all numbers in buckets by 1 , don't add input[i] in any of buckets .
If count of any number becomes zero [see hash_2(number)], Hash(number) = false .
This way , finally you will get atmost k elements , and the required number is one of them , so you need to traverse the input again O(N) to finally find the actual number .
The space used is O(K) and time complexity is O(N) , considering implementaion of hash O(1).
So , the performance really depends on K . If k << n , this method perform poorly .
I don't think this answers the question as stated in the title, but actually you can implement a cache with the Least-Frequently-Used eviction policy having constant average time for put, get and remove operations. If you maintain your data structure properly, there's no need to scan all items in order to find the item to evict.
The idea is having a hash table which maps keys to value records. A value record contains the value itself plus a reference to a "counter node". A counter node is a part of a doubly linked list, and consists of:
An access counter
The set of keys having this access count (as a hash set)
next pointer
prev pointer
The list is maintained such that it's always sorted by the access counter (where the head is min), and the counter values are unique. A node with access counter C contains all keys having this access count. Note that this doesn't increment the overall space complexity of the data structure.
A get(K) operation involves promoting K by migrating it to another counter record (either a new one or the next one in the list).
An eviction operation triggered by a put operation roughly consists of checking the head of the list, removing an arbitrary key from its key set, and then removing it from the hash table.
It is possible if we make reasonable (to me, anyway) assumptions about your data set.
As you say you could do it if you could hash, because you can simply count-by-hash. The problem is that you may get non-unique hashes. You mention 20bit numbers, so presumably 2^20 possible values and a desire for a small and fixed amount of working memory for the actual hash counts. This, one presumes, will therefore lead to hash collisions and thus a breakdown of the hashing algorithm. But you can fix this by doing more than one pass with complementary hashing algorithms.
Because these are memory addresses, it's likely not all of the bits are actually going to be capable of being set. For example if you only ever allocate word (4 byte) chunks you can ignore the two least significant bits. I suspect, but don't know, that you're actually only dealing with larger allocation boundaries so it may be even better than this.
Assuming word aligned; that means we have 18 bits to hash.
Next, you presumably have a maximum cache size which is presumably pretty small. I'm going to assume that you're allocating a maximum of <=256 items because then we can use a single byte for the count.
Okay, so to make our hashes we break up the number in the cache into two nine bit numbers, in order of significance highest to lowest and discard the last two bits as discussed above. Take the first of these chunks and use it as a hash to give a first part count. Then we take the second of these chunks and use it as a hash but this time we only count if the first part hash matches the one we identified as having the highest hash. The one left with the highest hash is now uniquely identified as having the highest count.
This runs in O(n) time and requires a 512 byte hash table for counting. If that's too large a table you could divide into three chunks and use a 64 byte table.
Added later
I've been thinking about this and I've realised it has a failure condition: if the first pass counts two groups as having the same number of elements, it cannot effectively distinguish between them. Oh well
Assumption: all the element is integer,for other data type we can also achieve this if we using hashCode()
We can achieve a time complexity O(nlogn) and space is O(1).
First, sort the array , time complexity is O(nlog n) (we should use in - place sorting algorithm like quick sort in order to maintain the space complexity)
Using four integer variable, current which indicates the value we are referring to,count , which indicate the number of occurrences of current, result which indicates the finale result and resultCount, which indicate the number of occurrences of result
Iterating from start to end of the array data
int result = 0;
int resultCount = -1;
int current = data[0];
int count = 1;
for(int i = 1; i < data.length; i++){
if(data[i] == current){
count++;
}else{
if(count > resultCount){
result = current;
resultCount = count;
}
current = data[i];
count = 1;
}
}
if(count > resultCount){
result = current;
resultCount = count;
}
return result;
So, in the end, there is only 4 variables is used.

Searching a file for unknown integer with minimum memory requirement [duplicate]

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

Categories