I found some questions on SO about checking operations before executions for over/underflow behavior. It seems that there are ways to do this quite easy. So why isn't there an option to automatically check each mathematical operation before execution or why is there not Exception for buffer over/underflow of arithmetic operations? Or phrased differently: In what scenario would it be useful to allow operations to overflow unnoticed?
Is it maybe a matter of run-time? Or is the main source of overflow occurring during non-mathematical operations?
Actually, for C there are checking options, see here: http://danluu.com/integer-overflow/
As for java, adding integer overflow checks would open a can of worms. As java does not offer unsigned types, unsigned math is often done in plain int or long types - obviously the VM will not be magically aware of the unsigned nature of the operation intended, meaning you either need to add unsigned types or the programmer would need to pay a lot of attention to turn the checks on/off. An example for unsigned math with signed types can be found in Arrays.binarySearch. On a side note, Java does exactly define what the result in case of overflow is, so relying on overflow behavior is legal use of defined behavior.
As briefly analyzed in the C link above, these checks can have a severe impact on performance in practice, due to a combination of crude implementation and/or by interfering with other code optimizations.
Also, while most CPU's can detect overflow (usually by the C and V flag), they do it simultaneously for signed/unsigned (common CPU ISA's do not make a distiction between signed/unsigned operations in case of add/sub). Its up to the program to respond to these flags, which means inserting additional instructions into the code. Again this means the programmer/compiler has to be aware if the operation is intended to be signed or unsigned to make the correct choice.
So overflow detection does come with a cost, albeit it could be made reasonably small with good compiler support.
But in many cases overflows are either not possible by design (e.g. the valid input parameters of a function cannot produce overflow), desired (e.g. wrap around behavior counters), or when they do happen are caught by other means when the result is used (e.g. by array bounds checking).
I have to think hard for instances where I actually ever felt the need for overflow checking. Usually you're far more concerned to validate the value range at specific points (e.g. function arguments). But these are arbitrary checks for a function specfic value range, which the compiler cannot even know (well, in some languages it would, because its explicitly expressed, but neither Java nor C fall in this category).
So overflow checking is not universally useful. It doesn't mean there aren't any potential bugs it could prevent, but compared to other bugs types overflow isn't really a common issue. I can't remember when I last saw a bug caused by integer overflow. Off by one bugs are far more common, for example. On the other hand, there are some microoptimizations that explicitly rely on overflow wraparound (e.g. an old question of mine, see accepted answer: Performance: float to int cast and clipping result to range).
With the situation as described, forcing C/Java to check and respond to integer overflow would make them worse languages. They would be slower, and/or the programmer would simply deactivate the feature because it gets in the way more than it is useful. That doesn't mean overflow checking as a language feature would generally be bad; but to really get something out of it, the environment also needs to fit (e.g. as mentioned above, Java would need unsigned types).
TL;DR It could be useful, but it requires much deeper language support than just a switch to be useful.
I can offer two potential factors as to why unchecked arithmetic is the default:
Sense of familiarity: Arithmetic in C and C++ is unchecked by default and people who got used to those languages would not expect the program to throw, but to silently continue. This is a misconception, as both C and C++ have undefined behavior on signed integer overflow/underflow. But nonetheless, it has created a certain expectation in many people's minds and new languages in the same family tend to shy away from visibly breaking established conventions.
Benchmark performance: Detecting overflow/underflow usually requires the execution of more instructions than you would need if you decided to ignore it. Imagine how a new language would look like if a person not familiar with it wrote a math-heavy benchmark (as it often happens) and "proved" that the language is dramatically slower than C and C++ even for the simplest mathematical operations. This would damage people's perception of the language's performance and it could hinder its adoption.
The Java language just does not have this feature built-in as a keyword or mechanism to apply directly for the +, - and * operators. For example, C# has the checked and unchecked keywords for this. However, these checks can be costly and hard to implement, when there is no native support in the language. As for Java 1.8, the methods addExact, subtractExact and multiplyExact have been added to the API to provide this feature, as pointed out by #Tom in the comments.
Why is this not done automatically even if the language supports it? The simple answer is that in general over- and underflow can be accepted or wanted behaviours or they simply do not occur because of a sophisticated and well executed design as it should be. I would say that exploiting over- and underflows is rather a low-level or harware programming concern to avoid additional operations for performance reasons.
Overall, your application design should either explicitly state the sensible use of arithmetic over- and underflows or better not need to use them at all, because it can lead to confusion, unintuitive behaviour or critical bugs. In the first case you don't check, in the second case the check would be useless. An automatic check would be superfluos and only cost performance.
A contrived example of a wanted overflow, could be a counter. Say you have an unsigned short and count it up. After 65536 it goes back to zero because of the overflow, which can be convenient.
Related
Is there an algorithm which can be applied to most languages to determine if a calculation when preformed will overflow the type size?
As an example if given the following code fragment in Java (although again I am looking for a general approach in any language)
long variable = 2;
while(true){
variable = variable * Generalclass.function();
}
assuming that Generalclass.function() will return something which causes variable to increase eventually variable will overflow, so how can it be determined if this call has caused this to happen if Generalclass.function()'s properties are unknown (properties other than increasing the value of variable). Note that variable is declared as a long so simply checking against a larger data type will not work since no such data type exists.
The most direct answer to your question is "no, there isn't an algorithm to compute if a calculation will overflow". Each type of operation (add, mult, etc.) has unique requirements for detecting overflow conditions. At the machine language level, some processors have special condition codes which can be checked if an arithmetic operation overflowed, but when coding at a higher level, you don't have access to this.
Check out the book, "Hacker's Delight" for some algorithms for detecting overflow. You might also want to look at the source code for the various Java Math "exact" methods. Some of the implementations refer to Hacker's Delight, and many of them also have "intrinsic" versions which are replaced with low-level alternative implementations which are faster.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Following pseudo-C++-code:
vector v;
... filling vector here and doing stuff ...
assert(is_sorted(v));
auto x = std::find(v, elementToSearchFor);
find has linear runtime, because it's called on a vector, which can be unsorted. But at that line in that specific program we know that either: The program is incorrect (as in: it doesn't run to the end if the assertion fails) or the vector to search for is sorted, therefore allowing a binary search find with O(log n). Optimizing it into a binary search should be done by a good compiler.
This is only the easiest worst case behavrior I found so far (more complex assertions may allow even more optimization).
Do some compilers do this? If yes, which ones? If not, why don't they?
Appendix: Some higher level languages may easily do this (especially in case of FP ones), so this is more about C/C++/Java/similar languages
Rice's Theorem basically states that non-trivial properties of code cannot be computed in general.
The relationship between is_sorted being true, and running a faster search is possible instead of a linear one, is a non-trivial property of the program after is_sorted is asserted.
You can arrange for explicit connections between is_sorted and the ability to use various faster algorithms. The way you communicate this information in C++ to the compiler is via the type system. Maybe something like this:
template<typename C>
struct container_is_sorted {
C c;
// forward a bunch of methods to `c`.
};
then, you'd invoke a container-based algorithm that would either use a linear search on most containers, or a sorted search on containers wrapped in container_is_sorted.
This is a bit awkward in C++. In a system where variables could carry different compiler-known type-like information at different points in the same stream of code (types that mutate under operations) this would be easier.
Ie, suppose types in C++ had a sequence of tags like int{positive, even} you could attach to them, and you could change the tags:
int x;
make_positive(x);
Operations on a type that did not actively preserve a tag would automatically discard it.
Then assert( {is sorted}, foo ) could attach the tag {is sorted} to foo. Later code could then consume foo and have that knowledge. If you inserted something into foo, it would lose the tag.
Such tags might be run time (that has cost, however, so unlikely in C++), or compile time (in which case, the tag-state of a given variable must be statically determined at a given location in the code).
In C++, due to the awkwardness of such stuff, we instead by habit simply note it in comments and/or use the full type system to tag things (rvalue vs lvalue references are an example that was folded into the language proper).
So the programmer is expected to know it is sorted, and invoke the proper algorithm given that they know it is sorted.
Well, there are two parts to the answer.
First, let's look at assert:
7.2 Diagnostics <assert.h>
1 The header defines the assert and static_assert macros and
refers to another macro,
NDEBUG
which is not defined by <assert.h>. If NDEBUG is defined as a macro name at the point in the source file where <assert.h> is included, the assert macro is defined simply as
#define assert(ignore) ((void)0)
The assert macro is redefined according to the current state of NDEBUG each time that <assert.h> is included.
2 The assert macro shall be implemented as a macro, not as an actual function. If the macro definition is suppressed in order to access an actual function, the behavior is undefined.
Thus, there is nothing left in release-mode to give the compiler any hint that some condition can be assumed to hold.
Still, there is nothing stopping you from redefining assert with an implementation-defined __assume in release-mode yourself (take a look at __builtin_unreachable() in clang / gcc).
Let's assume you have done so. Now, the condition tested could be really complicated and expensive. Thus, you really want to annotate it so it does not ever result in any run-time work. Not sure how to do that.
Let's grant that your compiler even allows that, for arbitrary expressions.
The next hurdle is recognizing what the expression actually tests, and how that relates to the code as written and any potentially faster, but under the given assumption equivalent, code.
This last step results in an immense explosion of compiler-complexity, by either having to create an explicit list of all those patterns to test or building a hugely-complicated automatic analyzer.
That's no fun, and just about as complicated as building SkyNET.
Also, you really do not want to use an asymptotically faster algorithm on a data-set which is too small for asymptotic time to matter. That would be a pessimization, and you just about need precognition to avoid such.
Assertions are (usually) compiled out in the final code. Meaning, among other things, that the code could (silently) fail (by retrieving the wrong value) due to such an optimization, if the assertion was not satisfied.
If the programmer (who put the assertion there) knew that the vector was sorted, why didn't he use a different search algorithm? What's the point in having the compiler second-guess the programmer in this way?
How does the compiler know which search algorithm to substitute for which, given that they all are library routines, not a part of the language's semantics?
You said "the compiler". But compilers are not there for the purpose of writing better algorithms for you. They are there to compile what you have written.
What you might have asked is whether the library function std::find should be implemented to potentially seek whether or not it can perform the algorithm other than using linear search. In reality it might be possible if the user has passed in std::set iterators or even std::unordered_set and the STL implementer knows detail of those iterators and can make use of it, but not in general and not for vector.
assert itself only applies in debug mode and optimisations are normally needed for release mode. Also, a failed assert causes an interrupt not a library switch.
Essentially, there are collections provided for faster lookup and it is up to the programmer to choose it and not the library writer to try to second guess what the programmer really wanted to do. (And in my opinion even less so for the compiler to do it).
In the narrow sense of your question, the answer is they do if then can but mostly they can't, because the language isn't designed for it and assert expressions are too complicated.
If assert() is implemented as a macro (as it is in C++), and it has not been disabled (by setting NDEBUG in C++) and the expression can be evaluated at compile time (or can be data traced) then the compiler will apply its usual optimisations. That doesn't happen often.
In most cases (and certainly in the example you gave) the relationship between the assert() and the desired optimisation is far beyond what a compiler can do without assistance from the language. Given the very low level of meta-programming capability in C++ (and Java) the ability to do this is quite limited.
In the wider sense I think what you're really asking for is a language in which the programmer can make asserts about the intention of the code, from which the compiler can choose between different translations (and algorithms). There have been experimental languages attempting to do that, and Eiffel had some features in that direction, but I'm now aware of any mainstream compiled languages that could do it.
Optimizing it into a binary search should be done by a good compiler.
No! A linear search results in a much more predictable branch. If the array is short enough, linear search is the right thing to do.
Apart from that, even if the compiler wanted to, the list of ideas and notions it would have to know about would be immense and it would have to do nontrivial logic on them. This would get very slow. Compilers are engineered to run fast and spit out decent code.
You might spend some time playing with formal verification tools whose job is to figure out everything they can about the code they're fed in, which asserts can trip, and so forth. They're often built without the same speed requirements compilers have and consequently they're much better at figuring things out about programs. You'll probably find that reasoning rigorously about code is rather harder than it looks at first sight.
I have learnt(atleast in java) that integer/long values overflow silently and their values start over from minimum value on overflow rather than throwing any exception.
I was using an external api for some file operations, in which max file size was being loaded from a property file. All was fine in my local testing environment. As soon as the code went to live environment, the max file size limit was not working at all. After two days of debugging/analyzing the code, there was no success at all. Then for some other reasons, I took the live constants.properties file and debugged the code with that. o_0
I just want to ask, what prevented them to throw exception on overflow?
In many cases Java is based on C or C++ and these are based on Assembly. An overflow/underflow is silent in C and C++ and almost silent in assembly (unless you check special flags). This is likely due to the fact that C and C++ didn't have exceptions when they were first proposed. If you wanted to see overflows/underflows you just used a larger type. e.g. long long int or long double ;) BTW assembly has something similar to exceptions called traps or interrupts, overflows/underflow doesn't cause a trap AFAIK.
What I prefer to do is use long and double unless I am sure these types are much larger than needed. You can't have a device which overflows long in size.
The reason is "because the Java Language Specification says so".
Section 4.2.2. Integer Operations of the JLS says:
The integer operators do not indicate overflow or underflow in any way.
To me this makes sense, otherwise you’d need either:
an 'NumericOverflowException' to be thrown, which would require a 'try catch', or
a flag to be set on the primitive result, which would require a more complex handling of primitive operations
Both of which would make primitives and their operations “not simple”, and simplicity with primitives is a strength not worth sacrificing for a predicable and typically rare occurrence.
I am using Long/Integer data types very frequently in my application, to build Generic datatypes. I fear that using these wrapper objects instead of primitive data types may be harmful for performance since each time it needs to create objects which is an expensive operation. but also it seems that I have no other choice(when I have to use primtives with generics) rather than just using them.
However, still it would be great if you can suggest if there is anything I could do to make it better. or any way if I could just avoid it ??
Also What may be the downsides of this ?
Suggestions welcomed!
Repeat after me. "Creating an object in Java is not an expensive operation".
You are prematurely optimizing your application. A better approach is to implement it in the natural way using Integer and Long, then profile it to determine where the bottlenecks are. If the profiler tells you that use of Integer and Long is a performance issue, then look at ways to cure this.
If you determine that Integer and Long really are an issue, here are some things you could do:
Look for a class library that implements "collections" of primitive types; e.g. Trove. But beware that the APIs of such collection types won't be compatible with java.util.Collection and its descendants.
Use Integer.valueOf(int) and Long.valueOf(long) rather than new Integer(int) and new Long(long). The valueOf methods use a cache of frequently used objects to reduce the number of object creations.
#Rex Kerr's comment is that this is horrible advice. He is (I think) saying that the OP should optimize his application to reduce the use of Integer and Long before he knows that this will be a performance concern. I disagree.
At this point (when he asked the question), the OP didn't know that his application needed optimization. If the application runs "fast enough" without any optimization, then any developer time spent optimizing it would be better spent on something else.
At this point, the OP doesn't know where the performance bottlenecks are. If they are not in the handling of these values, then optimizing this aspect will be a waste of time. Note that generally speaking it is a bad idea to rely solely on your intuition to tell you where the bottlenecks are or are likely to be.
#Rex Kerr posits that it would be a lot of work to modify/restructure the code to fix performance issues due to over-use of Integer and Long. That's simply not true. A decent IDE makes it easy to make this sort of change in a small to medium size application.
If you have many collections, or large collections, you are likely to have performance problems. See http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf.
If you have many collections, or large collections, or many large collections of boxed types (e.g. Integer, Long) there are alternatives: one is the Mahout Collections library, from http://mahout.apache.org. Mahout collections have open hash tables, which address many of the issues in the linked PDF, and collections that store little-i-integers, etc. Another is Trove, if GPL doesn't bother you.
If you are not sure that your code qualifies as 'many,' 'large', or 'many large', then by all means use a profiler and see what's going on.
Like others say,
Premature optimization is root of
evil.
Having said that, prefer primitive types to boxed types wherever you can.
UPDATE: Might also add that according to developers that work with high-performing code (like distributed cache) boxing can indeed become a performance problem quite frequently. I also worked with high-performing apps. but have never identified boxing as a worthy optimization place yet.
You are better off profiling your application and looking at where your bottlenecks and hot spots are. These are very hard to predict most of the time. IMHO If you are not measuring, you are just guessing.
However, if you determine that using primitive in a collection would be more efficient, I suggest you try http://trove.starlight-systems.com/ It can make a big difference when it really matters but for 90% of the time, it doesn't.
Here's an excerpt from Sun's Java tutorials:
A switch works with the byte, short, char, and int primitive data types. It also works with enumerated types (discussed in Classes and Inheritance) and a few special classes that "wrap" certain primitive types: Character, Byte, Short, and Integer (discussed in Simple Data Objects).
There must be a good reason why the long primitive data type is not allowed. Anyone know what it is?
I think to some extent it was probably an arbitrary decision based on typical use of switch.
A switch can essentially be implemented in two ways (or in principle, a combination): for a small number of cases, or ones whose values are widely dispersed, a switch essentially becomes the equivalent of a series of ifs on a temporary variable (the value being switched on must only be evaluated once). For a moderate number of cases that are more or less consecutive in value, a switch table is used (the TABLESWITCH instruction in Java), whereby the location to jump to is effectively looked up in a table.
Either of these methods could in principle use a long value rather than an integer. But I think it was probably just a practical decision to balance up the complexity of the instruction set and compiler with actual need: the cases where you really need to switch over a long are rare enough that it's acceptable to have to re-write as a series of IF statements, or work round in some other way (if the long values in question are close together, you can in your Java code switch over the int result of subtracting the lowest value).
Because they didn't implement the necessary instructions in the bytecode and you really don't want to write that many cases, no matter how "production ready" your code is...
[EDIT: Extracted from comments on this answer, with some additions on background]
To be exact, 2³² is a lot of cases and any program with a method long enough to hold more than that is going to be utterly horrendous! In any language. (The longest function I know of in any code in any language is a little over 6k SLOC – yes, it's a big switch – and it's really unmanageable.) If you're really stuck with having a long where you should have only an int or less, then you've got two real alternatives.
Use some variant on the theme of hash functions to compress the long into an int. The simplest one, only for use when you've got the type wrong, is to just cast! More useful would be to do this:
(int) ((x&0xFFFFFFFF) ^ ((x >>> 32) & 0xFFFFFFFF))
before switching on the result. You'll have to work out how to transform the cases that you're testing against too. But really, that's still horrible since it doesn't address the real problem of lots of cases.
A much better solution if you're working with very large numbers of cases is to change your design to using a Map<Long,Runnable> or something similar so that you're looking up how to dispatch a particular value. This allows you to separate the cases into multiple files, which is much easier to manage when the case-count gets large, though it does get more complex to organize the registration of the host of implementation classes involved (annotations might help by allowing you to build the registration code automatically).
FWIW, I did this many years ago (we switched to the newly-released J2SE 1.2 part way through the project) when building a custom bytecode engine for simulating massively parallel hardware (no, reusing the JVM would not have been suitable due to the radically different value and execution models involved) and it enormously simplified the code relative to the big switch that the C version of the code was using.
To reiterate the take-home message, wanting to switch on a long is an indication that either you've got the types wrong in your program or that you're building a system with that much variation involved that you should be using classes. Time for a rethink in either case.
Because the lookup table index must be 32 bits.
Its just happened to me that I come across this 12 years old question and I can provide one of the best solutions to this problem i.e. use the latest jdk because long and Long are now supported in switch-case statement. :)
A long, in 32bit architectures, is represented by two words. Now, imagine what could happen if due to insufficient synchronization, the execution of the switch statement observes a long with its high 32 bits from one write, and the 32 low ones from another! It could try to go to ....who knows where! Basically somewhere at random. Even if both writes represented valid cases for the switch statement, their funny combination would probably lead neither to the first nor to the second -- or extremely worse, it could lead to another valid, but unrelated case!
At least with an int (or lesser types), no matter how badly you mess up, the switch statement will at least read a value that someone actually wrote, instead of a value "out of thin air".
Of course, I don't know the actual reason (it's been more than 15 years, I haven't been paying attention that long!), but if you realize how unsafe and unpredictable such a construct could be, you'll agree that this is a definitely very good reason not to ever have a switch on longs (and as long -pun intended- there will be 32bit machines, this reason will remain valid).