Implementing a writeBit method in Java - java

So I know that in Java you cannot write out individual bits to a file and that you have to use writeByte. I have some understanding that there is a way to implement a writeBit method that makes use of writeByte by calling writeByte once 8 'bits' are concatenated together. I was hoping to implement this like:
public void writeBit(char bit) {
try {
//functionality here
} catch (IOException e) {
System.out.println(e);
}
}
But I just cannot seem to get started. I understand that I should probably have some attribute that keeps track of how many bits I have concatenated, but other than that I'm lost as to how to implement this.
I guess my big question here is how can I continuously call writeBit without losing my concatenated String of bits, and what would an implementation of writeBit look like, if it were to make use of writeByte?
As a side note, I am using a DataOutputStream here if this was not clear.

I've noticed a couple people have talked about using two instance variables, one to store the byte as you add bits to it and another to keep track of how many bits have been added so far. While this is a perfectly good way to do it, I'd like to show why you don't need a second instance variable.
Theory
There's no need to keep track of how many bits have been added so far. The only piece of information we need is "has the byte been filled yet?". Instead of initializing your byte to 0, try initializing it to a value of 1. Then each time you add a bit, shift the bits of the byte to the left one place (using the bitshift operator <<), and then add the new bit in the rightmost place.
In practice, it would look something like this, where X is the newly added bit:
Initialize the byte to a value of 1: 00000001
To insert a new bit, shift the bits to the left: 00000010
And add the new bit X in the rightmost place: 0000001X
Shift left: 000001X0
Add the new bit: 000001XX
Eventually, you'd have 7 bits written from your method and the leftmost bit would be 1: 1XXXXXXX
So in your method, you can check to see if the leftmost bit is set every time it's called. If it is, then you know you're ready to write the byte to the file on this iteration. You would start by doing the same thing, shifting left and then adding the new bit, so now you have XXXXXXXX. Then you would write the now-full byte to the file, and then reset the byte to a value of 1 so the cycle can start over again.
Writing the code
First you'll need an instance variable to keep track of these bits. It will need to be type byte, and I'll just call it buffer.
To shift the bits to the left one place, we can use the bitshift operator, <<. And, to make our lives even easier, there's even a bitshift assignment operator, <<=, so we can perform the bitshift and assign the new value back to the variable all in one operation. This leaves us with:
buffer <<= 1;
The next thing we'll need to do is add the new bit. If you OR a value with 1, the rightmost bit will be set, and the rest of the bits will be unaffected. If you OR a value with 0, none of the bits are affected. We can use this trick to only set the rightmost bit if the new bit is a 1 (The |= is the OR assignment operator):
buffer |= bit ? 1 : 0;
Then, the last piece of this code is writing the if statement to check if the leftmost bit is set. If it is, then when we AND it with 10000000, we will get 10000000. If not, we will get 00000000. 10000000 is 128 in decimal (or -128, or 256, all will work), so our expression is:
(buffer & 128) == 128
Result
Putting all these pieces together, we get:
// Notice bit is type boolean
public void writeBit(boolean bit) {
// If the leftmost bit in buffer is set:
if ((buffer & 128) == 128) {
// Shift all the bits in buffer to the left 1 place
buffer <<= 1;
// Add the new bit in the rightmost place
buffer |= bit ? 1 : 0;
// Write the now-full byte to the file
// I'm just calling your DataOutputStream "dos" here
try {
dos.writeByte(buffer);
} catch (IOException e) {
throw new RuntimeException();
}
// Reset buffer to a value of 1
buffer = 1;
} else {
// Shift all the bits in buffer to the left 1 place
buffer <<= 1;
// Add the new bit in the rightmost place
buffer |= bit ? 1 : 0;
}
}

Make a class with two instance variables, the first with the bits that you have accumulated so far, and the second with how many bits you have accumulated. Use the shift and or operations to insert a bit into the buffer (initialized to zero), and increment the number of bits. Once you have eight bits, write the buffer, then zero out the buffer and the count.
At the end you will need to flush any remaining bits if the count is not zero by writing the buffer to a byte, even though it contains less than eight bits. The format of the sequence of bits needs to be able to deal with this eventuality, unless it assures that a multiple of eight bits is always written.

Related

Why is this algorithm using the bitwise and operator in Java?

I see an algorithm such as:
for (int i = 1; i < sums.length; i++) {
/*
* dynamic programming:
* 1. remove a single block from the current subset of blocks
* 2. the corresponding block sum was already calculated
* 3. add the number on the removed block to it
*
* here: always choose the block corresponding
* to the least significant bit of i
*/
int t = Integer.numberOfTrailingZeros(i & -i);
sums[i] = sums[i & ~(1 << t)] + block[t];
//only add block subsets that add up to a face
if (masks.containsKey(sums[i]))
masks.get(sums[i]).add(i);
}
According to the comments, in this line (int t = Integer.numberOfTrailingZeros(i & -i);) the author means to choose an element in the block array according to the least significant bit of the number i.
Why is the author using a bitwise and operator on i and -i? Couldn't they just use i (e.g., Integer.numberOfTrailingZeros(i))?
Here is the larger body of code for context: http://pastebin.com/YB9wsdgD
Yeah, you can do that. Few plausible explanations:
The original version ported to something that had a home brew version and he left that trick in there.
Didn't understand that Integer.numberOfTrailingZeros() doesn't care if there are other significant bits and knew how to quickly get a value of just the most significant bit.
Or there was originally a home brew version that just got used i & -i and bitshifted until the value was zero and set that equal to t. And somebody just replaced that operation with the built-in without realizing i & -i was a trick that made that work the deleted operation work. Checking against zero is faster than other checks though it doesn't matter much anymore, so some hyper optimizing people would very often arrange things to check against zero, especially if they didn't have to subtract.
for (var i = 0; i < 1000; i++) {
document.write(i & -i);
document.write('<br>');
}
The i & -i returns the binary representation of the 2s complement which is to say flipping all the bits and adding one. As such it will end up with just the binary amount of the carry. You can then determine how many zeros are after the 1 in the binary representation there. Generally by bitshifting until you find the 1, and the number of bitshifts is how many zeros that number had in the least significant place. This can be used to bitshift until you have a zero. Though Integer's version of that code doesn't need that.

Varying value bit length array

I have recently come across the problem of creating arrays with values that have a specified bit length. Say an array with 13bits instead of 8,16,32 etc. I tried to look for a good tutorial/article about it as I am new to bit operations. Though I am not really sure of what to search for. I presume the array would work with a backing array of bytes or longs...
My ultimate question is if you can show me if there is a duplicate question or tutorial out there.
If not perhaps show me an example. AND if you got the time write a short explanation.
Thank you.
EDIT: The purpose is not to make an array of say longs but only use 40% of it. I want it to be packed together to save space to be compatible with the thing im making.
It's not possible to "create your own primitive types" in java. Also I don't think there is any library around here to do what you want. I think most people would go with the overhead of losing some memory, especially at bit level. Maybe C or Cpp would have been a wiser choice (and I'm not even sure).
You'll have to create your own bit manipulation library. There are many ways to do it, I'll give you one. I began using a byte[] but it's more complex. As a rule, use the biggest normal type (ex: for a 48bit elements, use 32 bit types as storage). so let's go with an int array (16 bits) for 100 of your 13bits types. I'll use big-endian-style storage.
int intArraySize = 100 * 16 / 13 + 1; // + 1 is just to be sure...
int[] intArray = new int[byteArraySize];
Now, how do you access the sixth value for example. You'll always need at least and at most two int of your array and an integer to store it.
int pos = 6;
int buffer = 0;
int firstPart = int Array[ (pos * 13) /16]; // 1010 0110 1100 0011
int secondPart = int Array[ (pos * 13) /16 + 1]; // 1001 1110 0101 1111
int begin = pos * 13 % 16;
The variable begin = 14 is the bit at which your number begins. So that means on your 13bits elements there are (16-14) 3 bits in the first (left) int and the rest (13-3 = 10) in the second (right).
The number you want is 1010 0110 1100 0{011 and 1001 1110} 0101 1111.
You're gonna put these two ints into one now. Right shift the secondPart 3 times (so it's the right part of your final number), and left shift the firstPart 10 times, add them in the buffer. Because it's a 13bits elements, you'll need to clean ( with a bitmask ) the 3 first elements of your 16 bit in the buffer, and voila !
I'll let you guess how to insert a value in the array (try doing the same step, but in reverse) and be carefull not to erase other values. And if you haven't looked yet: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/op3.html
Disclaimer: I didn't try the code, but you get the general idea. There might be some errors, maybe you'll have to add or remove 1 to begin. But you get the general idea. The first thing you should do is make a function that prints/log any integer (or byte, or whatever) into it's binary representation. Multiple possibilities here: Print an integer in binary format in Java because you're gonne need them to test every step of your code.
I still think it's a bad idea to store your special number this way, (seriously memory is rarely gonna be an issue), but I found the exercise interesting, and maybe you really need taht kind of storage. If your curious, take a look at the ByteArrayOutputStream, I'm not sure you'll ever need this for what you're doing but who knows.

How do I combine two AudioInputStream?

The file format is "PCM_SIGNED 44100.0 Hz, 16 bit, stereo, 4 bytes/frame, little-endian", and I want to add them together while amplifying one of the two files. I plan to read the two wav get put them into two audioinputstream instances, then store the instances into two byte[] array, manipulate in the arrays, and get return as another audioinputstream instance.
I have done a lot of research but I have got no good results.
I know that is a class from www.jsresources.org mixing two audioinputstream, but it doesn't allow me to modify either of the two streams before mixing while I want to decrease one of the streams before mixing them. What do you think I should do?
To do this, you can convert the streams to PCM data, multiply the channel whose volume you wish to change by the desired factor, add the PCM data from the results together, then convert back to bytes.
To access the AudioStreams on a per-byte basis, check out the first extended code fragment at the Java Tutorials section on Using Files and Format Converters. This shows how to get an array of sound byte data. There is a comment that reads:
// Here, do something useful with the audio data that's
// now in the audioBytes array...
At this point, iterate through the bytes, converting to PCM. A set of commands based on the following should work:
for (int i = 0; i < numBytes; i += 2)
{
pcmA[i/2] = audioBytesA[i] & 0xff ) | ( audioBytesA[i + 1] << 8 );
pcmB[i/2] = audioBytesB[i] & 0xff ) | ( audioBytesB[i + 1] << 8 );
}
In the above, audioBytesA and audioBytesB are two input streams (names based on the code from the example), and pcmA and pcmB could be either int arrays or short arrays, holding values that fit within the range of a short. It might be best to make pcm arrays floats since you will be doing some math that will result in fractions. Using floats as in the example below only adds one place worth of accuracy (better rounding than when using int), and int would perform faster. I think using floats is more often done if the audio data gets normalized for use with additional processing.
From there, the best way to change volume is to multiply every PCM value by the same amount. For example, to increase volume by 25%,
pcmA[i] = pcmA[i] * 1.25f;
Then, add pcmA and pcmB, and convert back to bytes. You might also want to put in min or max functions to ensure that the volume & merging do not exceed values that can fit in the format's 16 bits.
I use the following to convert back to bytes:
for (int i = 0; i < numBytes; i++)
{
outBuffer[i*2] = (byte) pcmCombined[i];
outBuffer[(i*2) + 1] = (byte)((int)pcmCombined[i] >> 8 );
}
Above assumes pcmCombined[] is a float array. The conversion code can be a bit simpler if it is a short[] or int[] array.
I cut and pasted the above from dev work I did for programs posted at my website, and edited it for your scenario, so if there is a typo or bug crept in, please let me know in the comments and I will fix it.

Searching a file for unknown integer with minimum memory requirement [duplicate]

I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)

implementing a patricia trie in java

I'm trying to rewrite a c++ patricia trie in java.
The c++ code is from here
full source code
I'm a bit stuck.
So here's my understanding:
#define ZEROTAB_SIZE 256
head->key = (char*)calloc(ZEROTAB_SIZE, 1);
we create an array of 256 bits for the key, so we can have a string with a maximum length of 32 characters and every character is represented with 8 bits. Can i implement this with a char array in java?
template <class T>
int PatriciaTrie<T>::bit_get(PatriciaTrieKey bit_stream, int n) {
if (n < 0) return 2; // "pseudo-bit" with a value of 2.
int k = (n & 0x7);
return ( (*(bit_stream + (n >> 3))) >> k) & 0x1;
}
k gets the last 7 bits of n, we move to the n/8 character of the string (not exactly n/8 since shifting to the right would remove anything lower than 8 to zero) then we shift the value of bit_stream[n>>3] by k and then we get last bit. if i use arrays in java could i rewrite this as
return (bit_stream[n>>3] >> k) & 0x1;
?
template <class T>
int PatriciaTrie<T>::bit_first_different(PatriciaTrieKey k1, PatriciaTrieKey k2) {
if (!k1 || !k2)
return 0; // First bit is different!
int n = 0;
int d = 0;
while ( (k1[n] == k2[n]) &&
(k1[n] != 0) &&
(k2[n] != 0) )
n++;
while (bit_get(&k1[n], d) == bit_get(&k2[n], d))
d++;
return ((n << 3) + d);
}
now this is where it gets confusing, the first part until the second while loop looks clear enough, loop and check how many bits are equal and non zero, but the i'm not sure what the second loop is doing, we take the address of the two keys and check the first bits if they're equal and if they are we check again until we find unequal bits?
Mainly i'm not sure how the address of the key is used here, but i might be confused on bit shifting in bit_get class too.
I want to do a comparison between there trie in c++ and java for my java class and i want to keep the implementations as similar as possible.
I'm not familiar with this data structure, but there are some problems with your understanding of this code.
First, calloc allocates 256 bytes, not bits. new byte[256] Would be comparable in java.
Second, n & 0x7 gets three bits of n, not seven. A clearer way to write this would be n/8 and n%8 instead of n>>3 and n & 7, but the bitwise operations might be slightly faster if your compiler is stupid.
You are correct that (bit_stream[n>>3]>>k) & 1 is the same.
Now, the first loop in bit_first_different loops over bytes, not bits. The check for 0 is to prevent running off the end of the keys. Once that loop terminates, n refers to the first differing byte. The second loop is then looking for which bit is different.
Note that if the two keys are not different, then the second loop may run off the end of the keys, potentially causing a segmentation fault.
Now, the & is taking the address of k1[n] because the bit_get function is expecting a pointer to a character...this passes in the nth element of the bit stream. After the loop, d is the offset of the first different bit of k[n].
Finally the code combines n (which byte?) With d (which bit in that byte?) to give the bit. Again I would advocate 8*n+d for clarity, but that's a matter of taste.
Can i implement this with a char array in java?
My java is a bit rusty but I believe char is signed in java which means that >> won't do what you expect it to. That's because shifting a signed number will not shift the sign bit so what you really want is the >>> operator or just use the byte type which is unsigned. I have a feeling that this is all kinds of wrong so please double-check.
return (bit_stream[n>>3] >> k) & 0x1;
In C or C++, *(array + k) is just another way to write array[k] so your translation looks correct. As for the interpretation, bit_stream[n>>3] essentially fetches the byte in which the desired bit is located. >> k Moves the desired bit in the least-significant bit position. Finally, we remove all the bits we're not interested in by masking them out with & 0x1. This leaves us with a value of either 0 or 1 depending on whether the bit was set or not.
What the final function does is compare 2 bit strings and returns the bit position where the 2 strings first differ. The first loop is essentially an optimized version of the second loop where instead of doing a bit by bit comparaison, it checks whole bytes instead.
In other words, it first loops over every bytes and find the first 2 that differ. It then takes those 2 differing bytes and loops over them until it finds the first 2 bit that differ. Note that the bit_get function is never going to receive an n greater 7 in this scenario because we know there's a difference somewhere in the byte. The final bit position is then constructed from the the result of both loops like so: (number_of_equal_bytes * 8) + number_of_equal_bits).

Categories