Is this an off-by-one bug in Java 7? - java

I don't know where to seek clarifications and confirmations on Java API documentation and Java code, so I'm doing it here.
In the API documentation for FileChannel, I'm finding off-by-one errors w.r.t. to file position and file size in more places than one.
Here's just one example. The API documenation for transferFrom(...) states:
"If the given position is greater than the file's current size then no bytes are transferred."
I confirmed that the OpenJDK code too contains this code...
public long transferFrom(ReadableByteChannel src, long position, long count)
throws IOException
{
// ...
if (position > size())
return 0;
// ...
}
... in file FileChannelImpl.java consistent with the documentation.
Now, while the above code snippet and the API documentation appear mutually consistent, I 'feel' that the above should be 'greater than or equal to' and not merely 'greater than' because position being a 0-based index into the file's data, reading at position == size() will have no data to return to the caller! (At position == size() - 1, at least 1 byte -- the last byte of the file -- could be returned to the caller.)
Here are some other similar instances in the same API documentation page:
position(...): "Setting the position to a value that is greater than the file's current size is legal but does not change the size of the file." (Should have been 'greater than or equal to'.)
transferTo(...): " If the given position is greater than the file's current size then no bytes are transferred." (Should have been 'greater than or equal to'.)
read(...): "If the given position is greater than the file's current size then no bytes are read." (Should have been 'greater than or equal to'.)
Lastly, the documentation section for the return value of read(...) fails to remain even self-consistent with the rest of the documentation. Here's what it states:
read(...)
Returns:
The number of bytes read, possibly zero, or -1 if the given position is greater than or equal to the file's current size
So, in this lone instance, I do see them mention the right thing.
Overall, I don't know what to make of all of this. If I write my code today matching this documentation, then a future bug fix in Java (code or documentation) will render my code buggy, requiring a fix from my side as well. If I do the right thing myself today with things as they stand today, then my code becomes buggy to start with!

This could be a bit clearer in the javadoc and is now tracked here:
https://bugs.openjdk.java.net/browse/JDK-8029370
Note that clarifying the javadoc shouldn't change anything for implementations of FileChannel (for example, the transfer methods return the number of bytes transferred so this is 0 when the position is at size or beyond size).

It is not an off-by-one bug in that there is no behavior problem right? At best a doc problem. The docs aren't wrong here maybe just not complete.
However I am not sure they are missing something. position == size() is not an exceptional situation. It is a situation where 0 bytes can be read kind of by definition. position > size() is exceptional: less than 0 bytes can be read. It needs a note. Does an exception throw? Or nothing. read() is different since it has to return a byte so having 0 to read is the exceptional condition too.
I personally think the fact that you're asking means the docs could be more explicit. It is also not clear why that method doesnt short circuit rather than try to transfer 0 bytes. But there looks like a possible logic behind it.
(Or do you mean it does something not documented?)

Related

BitSet.size() returns negative value. Known bug?

new BitSet(Integer.MAX_VALUE).size() reports a negative value:
import java.util.BitSet;
public class NegativeBitSetSize {
public static void main(String[] args) {
BitSet a;
a = new BitSet(Integer.MAX_VALUE);
System.out.println(a.size()); // -2147483648
a = new BitSet(Integer.MAX_VALUE - 50);
System.out.println(a.size()); // -2147483648
a = new BitSet(Integer.MAX_VALUE - 62);
System.out.println(a.size()); // -2147483648
a = new BitSet(Integer.MAX_VALUE - 63);
System.out.println(a.size()); // 2147483584
}
}
On the test system:
$ java -version
openjdk version "11.0.14" 2022-01-18
OpenJDK Runtime Environment (build 11.0.14+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.14+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
I couldn't find a bug report for this. Is this known or documented?
I doubt this would be documented. It certainly won't be 'fixed', as there is no sensible fix available that doesn't break backwards compatibility, and it is nowhere near relevant enough to take such drastic steps.
Digging under the hood - why is this happening?
Whilst the API docs make no such guarantee, the effect of size() is that it simply returns the nBits value you passed when you constructed the BitSet instance... but rounded up to the next value that is evenly divisible by 64:
sysout(new BitSet(1).size()); // 64
sysout(new BitSet(63).size()); // 64
sysout(new BitSet(64).size()); // 64
sysout(new BitSet(65).size()); // 128
sysout(new BitSet(100).size()); // 128
sysout(new BitSet(128).size()); // 128
sysout(new BitSet(129).size()); // 192
This is logical; the implementation uses an array of long values to store these bits (as that's (by a factor of 8!) more efficient than using e.g. a boolean[], as each boolean still takes up a byte in an array, and an entire long's worth of bits as lone variable).
The spec doesn't guarantee this, but it explains why this is happening.
It then also explains why you are witnessing what you are: Integer.MAX_VALUE is 2147483647. Round that up to the nearest multiple of 64 and you get... 2147483648. Which overflows int - and Integer.MAX_VALUE + 1 / (int) 2147483648L - are both the same value: -2147483648. That is the one value that exists in signed int space as a negative number with no matching positive number (that makes sense too: Some bit sequence needs to represent 0 which is neither positive or negative. By convention / by the rules of 2s complement, which is how java represents in bit form all numbers, the 0 is in the 'positive' space (given that it's all 0 bits). It thus 'leaches' a number from there, and that number is 2147483648.
Let's fix it!
One easy fix is to have the size() method return a long instead, which can trivially represent 2147483648, which is the correct answer. Unfortunately, this is not backwards compatible. Hence, extremely unlikely to succeed if one would ask for that change.
Another fix is to create a second method with some throw-in-the-towel name such as accurateSize() or whatnot, so that size() remains unmolested and thus backwards compatibility is preserved, which does return long. But this is dirtying up the API forever, for a detail that is irrelevant for all cases except the largest 63 numbers you can ask for. (Integer.MAX_VALUE-62 through Integer.MAX_VALUE are the only values you can pass for nBits which result in size() returning a negative value. The negative value returned will always be Integer.MIN_VALUE. I doubt they'd do that.
A third fix is to lie and return Integer.MAX_VALUE instead, which isn't quite the right value (as 1 bit more is in fact 'available' in the bit space). Given that you can't actually 'set' that bit value, as you can't pass 2147483648 to the constructor (as you must pass an int, that number is not passable as an int, if you try you end up with -2147483648, which is negative and causes the constructor to throw, hence not giving you an instance: Without hackery such as using reflection to set private fields, which APIs do not need to adress, you can't make a BitSet that can actually store the value of the 2147483648th bit.
This then gets us to what the point of size() is. Is it for telling you the amount of bytes that the BitSet object occupies? If that's the point, it's never been a great way to go about it: The JVM doesn't guarantee that a long[]'s memory size is arrSize*8 bytes (though all JVM impls have that, + some low overhead for the array's header structure).
Instead it is perhaps simply to let you know what you can do with it. Even if you call, say, new BitSet(5), you can still set the 6th bit (because why not - it doesn't "cost" anything, I guess that was the intent). You can set all bits from 0 up to the .size() minus 1.
And this gets us to the real answer!
size() is not actually broken. The number returned is entirely correct: That is, in fact, the size. It's merely that when you print it, it 'prints wrong' - because size()'s return value should be interpreted as unsigned. The javadoc of size() explicitly calls out its only point, which is to take that number, and subtract 1: This then tells you the maximum element you can set.
And this works just fine:
BitSet x = new BitSet(Integer.MAX_VALUE);
int maxIndex = x.size() - 1;
System.out.println(maxIndex);
x.set(maxIndex);
The above code works fine. That maxIndex value is 2147483647 (Which is Integer.MAX_VALUE) as expected.
Hence, there's really nothing to do here: The API is fine as is and does what it suggests you use it for accurately. Any API you care to come up with that's 'better' would be backwards incompatible; changing BitSet is not a good idea, adding more methods, java.util.Vector style uglies up the API which is definitely a case of the cure being worse than the disease.
That just leaves adding notes to the docs. If you delve into this level of exotics in docs, you end up with huge documentation that is, again, a cure worse than the disease. The sustainable solution would perhaps be for javadoc to gain a fundamental ability to write esoteric footnotes, which e.g. the javadoc tool can turn into HTML by way of a 'folding' popdown interface element that is folded up by default (i.e. the exotic footnotes are not visible), but can be expanded if you really want to read the details.
Javadoc doesn't have this.
CONCLUSION: One can easily argue the API isn't broken at all; nothing in size() explicitly says that the returned value should be interpreted as a signed int; the only explicit promise is that you can subtract 1 from the result and use that as index, which works fine. At best, you could file a bug report to get the docs updated, except that's not a good idea because it's not (easily) possible to add such esoterics to the documentation. If you do want to go down that path, there's a lot more of this sort of thing in the JDK libraries that aren't documented either.

How do I search a sorted, huge, direct buffer efficiently in Java?

I have a direct buffer holding Integers that are already sorted (i.e. 1,1,3,3,3,3,7,7,....). Most values will occur multiple times. I want to find the first position of values I search for.
Is there a search functionality directly working of buffers
built-into Java? (couldn't find anything)
If not, is there any decent library providing such functionality?
If not, what search algorithm would recommend for implementation, given that:
I will typically have millions of entries in my buffer
Speed is very important
It must return the first occurrence of the searched number
I'd rather not have it modify the data as I will need the original data afterwards
EDIT: Thanks to all the posters suggesting Arrays.binarySearch(), but, as far as I know, direct buffers do not generally have a backing array. That's why I was looking for an implementation that directly works on the buffer.
Also, each value can occur up to a thousand times, therefore a linear search after finding a landing point might not be very efficient. The comparator suggestion of dasblinkenlight might work though.
The best approach would be to code your own implementation of Binary Search for the buffers. This approach carefully avoids potential performance hits associated with creating views, copying large arrays etc., and stays compact at the same time.
The code sample at the link returns the rightmost point; you need to replace > with >= on the nums[guess] > check line to get the leftmost point. This saves you potentially costly backward linear search, or using a "backward" Comparator, which requires wrapping your int into Integer objects.
Use Binary search algorithm
ByteBuffer buffer = createByteBuffer();
IntBuffer intBuffer = buffer.asIntBuffer();
If byte array can be converted to int array use:
int [] array = intBuffer.array();
int index = java.util.Arrays.binarySearch(array,7);
I don't know about a built-in functionality for buffers (Arrays.binarySearch(...) would require you to convert the buffer to an array) but as for 3.: since the buffer is already sorted a binary search might be useful. If you found the value you could then check the previous values to get the start of that sequence.
You'll probably have to write your own binary search: one that always moves to the left if the value checked is equal to the one searched.
So effectively instead of x, you're going to search for x-ε. Your algorithm will always take exactly logn (or logn+1) steps, as it will always "fail", but it will give you the index of the first element that is bigger than x-ε. All you need to do is check if that element is x, and if it is, you've found your match, if it isn't, there's no x in your buffer.

RLE sequence, setting a value

Say I have an arbitrary RLE Sequence. (For those who don't know, RLE compresses an array like [4 4 4 4 4 6 6 1 1] into [(5,4) (2,6) (2,1)]. First comes the number of a particular integer in a run, then the number itself.)
How can I determine an algorithm to set a value at a given index without decompressing the whole thing? For example, if you do set(0,1) the RLE would become [(1,1) (4,4) (2,6) (2,1)]. (In set, the first value is the index, the second is the value)
Also, I've divided this compressed sequence into an ArrayList of Entries. That is, each Entry is one of these: (1,1) where it has an amount and value.
I'm trying to find an efficient way to do this, right now I can just think of methods that have WAY too many if statements to be considered clean. There are so many possible variations: for example, if the given value splits an existing entry, or if it has the same value as an existing entry, etc...
Any help would be much appreciated. I'm working on an algorithm now, here is some of it:
while(i<rleAL.size() && count != index)
{
indexToStop=0;
while(count<index || indexToStop == rleAL.get(i).getAmount())
{
count++;
indexToStop++;
}
if(count != index)
{
i++;
}
}
As you can see this is getting increasingly sloppy...
Thanks!
RLE is generally bad at updates, exactly for the reason stated. Changing ArrayList to LinkedList won't help much, as LinkedList is awfully slow in everything but inserts (and even with inserts you must already hold a reference to a specific location using e.g. ListIterator).
Talking about the original question, though, there's no need to decompress all. All you need is find the right place (summing up the counts), which is linear-time (consider skip list to make it faster) after which you'll have just four options:
You're in a block, and this block is the same as a number you're trying to save.
You're inside a block, and the number is different.
You're in the beginning or the end of the block, the number differs from the block's but same as neighbour has.
You're in the beginning or the end of the block, the number is neither the same as block's nor the one of neighbour's.
The actions are the same, obviously:
Do nothing
Change counter in the block, add two blocks
Change counters in two blocks
Change counter in one block, insert a new one
(Note though if you have skip lists, you must update those as well.)
Update: it gets more interesting if the block to update is of length 1, true. Still, it all stays as trivial: in any case, the changes are limited to maximum of three blocks.

How does java handle integer overflow and underflow?

i know this is an old question, asked many times. but i am not able to find any satisfactory answer for this, hence asking again.
can someone explain what exactly happens in case of integer overflow and underflow?
i have heard about some 'lower order bytes' which handle this, can someone explain what is that?
thanks!
You could imagine that when you have only 2 places you are counting (so adding 1 each time)
00
01
10
11
100
But the last one gets cut down to "00" again. So there is your "overflow". You're back at 00. Now depending on what the bits mean, this can mean several things, but most of the time this means you are going from the highest value to the lowest. (11 to 00)
Mark peters adds a good one in the comments: even without overflow you'll have a problem, because the first bit is used as signing, so you'll go from high to low without losing that bit. You could say that the bit is 'separate' from the others
Java loops the number either to the maximum or minimum integer (depending on whether it is overflow or underflow).
So:
System.out.println(Integer.MAX_VALUE + 1 == Integer.MIN_VALUE);
System.out.println(Integer.MIN_VALUE - 1 == Integer.MAX_VALUE);
prints true twice.
It basically handles them without reporting an exception, performing the 2's complement arithmetic without concern for overflow or underflow, returning the expected (but incorrect) result based on the mechanics of 2's complement arithmetic.
This means that the bits which over or underflow are simply chopped, and that Integer.MIN_VALUE - 1 typically returns Integer.MAX_VALUE.
As far as "lower order bytes" being a workaround, they really aren't. What is happening when you use Java bytes to do the arithmetic is that they get expanded into ints, the arithmetic is generally performed on the ints, and the end result is likely to be completely contained in the returned it as it has far more storage capacity than the starting bytes.
Another way to think of how java handles overflow/underclock is to picture an anology clock. You can move it forward an hour at a time but eventually the hours will start again. You can wind the clock backward but once you go beyond the start you are at the end again.

changing the index positioning in InputStream

I have a binary file which contains keys and after every key there is an image associated with it. I want to jump off different keys but could not find any method which changes the index positioning in input stream. I have seen the mark() method but it does not jump on different places.
Does anybody have any idea how to do that?
There's a long skip(long n) method that you may be able to use:
Skips over and discards n bytes of data from this input stream. The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. If n is negative, no bytes are skipped.
As documented, you're not guaranteed that n bytes will be skipped, so doublecheck the returned value always. Note that this does not allow you to "skip backward", but if it's markSupported(), you can reset() first and then skip forward to an earlier position if you must.
Other options
You may also use java.io.RandomAccessFile, which as the name implies, permits random access with its seek(long pos) method.
You mentioned images, so if you are using Java Advanced Imaging, another possible option is com.sun.media.jai.codec.FileSeekableStream, which is a SeekableStream that takes its input from a File or RandomAccessFile. Note that this class is not a committed part of the JAI API. It may be removed or changed in future releases of JAI.

Categories