Java not deterministic?

Java not deterministic? - java

I've written a little predator-prey simulation in Java. Even if the rules are quite complicated and end up in a chaotic system the techniques used are simple:
arithmetics and decisions on basic data types
no external libraries
no external systems are included
no concurrency occurs
no use of current time or date
So I thought when initializing the system with identical parameters it should output identical results, but it doesn't and I wonder why.
Some thoughts on that:
My application uses Randoms, but for that test I initialize them all with a given value, so in my understanding they should create for every run the same outputs in the same order.
I'm iterating through Sets, and I know that the order a Set is iterated isn't defined. But I don't see any reason why a Set that is filled in the same order with the same values should behave different in several runs. Does it?
I'm using a lot of floats. Datatypes where 1 + 1 = 1.9999999999725 are always suspect to me, but I even if their behavior is strange to me, it should be always the same strange. Isn't it?
Garbage Collection isn't deterministic, but as long as I don't rely on destructors I should be safe.
As said above, there is no concurrency and no datatypes depending on the actual time in use.
I can't reproduce that behavior in a simple example. But going through my code, I can't see anything that could be unpredictable. So are any of my assumptions above wrong? Any ideas what I could be missing?
Here's a test to verify my assumptions:
public static void main(String[] args) {
Random r = new Random(1);
Set<Float> s = new HashSet<Float>();
for (int i = 0; i < 1000000; i++) {
s.add(r.nextFloat());
}
float ret = 1;
int cnt = 0;
for (Float f : s) {
float multiply = 0.3f;
if (cnt++ % 2 == 0) {
multiply = 0.7f;
}
float f2 = (f * multiply);
ret += f2;
}
System.out.println(ret);
}
It results always in 242455.25 for me.

You can write a deterministic program in Java. You just need to eliminate the possible sources of non-determinism.
It's hard to know what could be causing non-determinism without seeing your actual code, and concrete evidence of that determinism.
There are any number of library methods that could potentially be sources of non-deterministic behaviour ... depending on how you use them.
For example, the value returned by Object.hashcode() (the first time it is called on an instance) is non-deterministic. And that percolates through to any library that uses hashing. It can definitely affecting the order in which elements of a HashSet or HashMap are returned when you iterate them ... if the element class doesn't override hashcode().
Random number generators may or may not be deterministic. If they are pseudo-random and they are initialized with fixed seeds, then the sequence of numbers produced by each one will be deterministic.
Floating point arithmetic should be deterministic. For any (fixed) set of inputs to an arithmetic expression, the result should always be the same. (I'm not sure that determinism of floating point arithmetic is guaranteed by the JLS, but it would be mighty strange if it happened in practice. As in ... you are running on broken hardware.)
FOLLOWUP ... on strictfp and non-determinism.
According to the JLS 15.4:
"Within an expression that is not FP-strict, some leeway is granted for an implementation to use an extended exponent range to represent intermediate results; the net effect, roughly speaking, is that a calculation might produce "the correct answer" in situations where exclusive use of the float value set or double value set might result in overflow or underflow."
This doesn't exactly say how much "leeway" the implementation has in a non-FP-strict expressions. However, I'd have thought that that leeway would not extend to allowing non-deterministic behaviour. I'd have thought that a JIT compiler on a particular platform would always generate equivalent native code for the same expression, and that code would be deterministic. (I can't see any reason for non-determinism ... unless the hardware itself has non-deterministic floating point.) The other possible source of non-determinism might be that behaviour of JIT compiled and interpreted code might be different. But frankly, I think that it would be "nuts" to allow that to happen ... and I think we'd have heard of it.
So while non-FP-strict expression evaluation could be non-deterministic in theory, I think we should discount this ... unless there is clear evidence that it happens in practice.
(Note that I'm talking about real non-determinism, not platform differences.)

I'm iterating throu Sets, and I know that the order a Set is iterated isn't definied. But I don't see any reason why a Set that is filled in the same order with the same values should behave differnet in several runs. Does it?
It can. The implementation is free to use, for example, the object's location in memory as the key into the underlying hash table. That can vary depending on when garbage collection runs.

Related

Expression evaluation in C vs Java

int y=3;
int z=(--y) + (y=10);
when executed in C language the value of z evaluates to 20
but when the same expression in java, when executed gives the z value as 12.
Can anyone explain why this is happening and what is the difference?

when executed in C language the value of z evaluates to 20
No it does not. This is undefined behavior, so z could get any value. Including 20. The program could also theoretically do anything, since the standard does not say what the program should do when encountering undefined behavior. Read more here: Undefined, unspecified and implementation-defined behavior
As a rule of thumb, never modify a variable twice in the same expression.
It's not a good duplicate, but this will explain things a bit deeper. The reason for undefined behavior here is sequence points. Why are these constructs using pre and post-increment undefined behavior?
In C, when it comes to arithmetic operators, like + and /, the order of evaluation of the operands is not specified in the standard, so if the evaluation of those has side effects, your program becomes unpredictable. Here is an example:
int foo(void)
{
printf("foo()\n");
return 0;
}
int bar(void)
{
printf("bar()\n");
return 0;
}
int main(void)
{
int x = foo() + bar();
}
What will this program print? Well, we don't know. I'm not entirely sure if this snippet invokes undefined behavior or not, but regardless, the output is not predictable. I made a question, Is it undefined behavior to use functions with side effects in an unspecified order? , about that, so I'll update this answer later.
Some other variables have specified order (left to right) of evaluation, like || and && and this feature is used for short circuiting. For instance, if we use the above example functions and use foo() && bar(), only the foo() function will be executed.
I'm not very proficient in Java, but for completeness, I want to mention that Java basically does not have undefined or unspecified behavior except for very special situations. Almost everything in Java is well defined. For more details, read rzwitserloot's answer

There are 3 parts to this answer:
How this works in C (unspecified behaviour)
How this works in Java (the spec is clear on how this should be evaluated)
Why is there a difference.
For #1, you should read #klutt's fantastic answer.
For #2 and #3, you should read this answer.
How does it work in java?
Unlike in C, java's language specification is far more clearly specified. For example, C doesn't even tell you how many bits the data type int is supposed to have, whereas the java lang spec does: 32 bits. Even on 64-bit processors and a 64-bit java implementation.
The java spec clearly says that x+y is to be evaluated left-to-right (vs. C's 'in any order you please, compiler'), thus, first --y is evaluated which is clearly 2 (with the side-effect of making y 2), and then y=10 is evaluated which is clearly 10 (with the side effect of making y 10), and then 2+10 is evaluated which is clearly 12.
Obviously, a language like java is just better; after all, undefined behaviour is pretty much a bug by definition, whatever was wrong with the C lang spec writers to introduce this crazy stuff?
The answer is: performance.
In C, your source code is turned into machine code by the compiler, and the machine code is then interpreted by the CPU. A 2-step model.
In java, your source code is turned into bytecode by the compiler, the bytecode is then turned into machine code by the runtime, and the machine code is then interpreted by the CPU. A 3-step model.
If you want to introduce optimizations, you don't control what the CPU does, so for C there is only 1 step where it can be done: Compilation.
So C (the language) is designed to give lots of freedom to C compilers to attempt to produce optimized machine code. This is a cost/benefit scenario: At the cost of having a ton of 'undefined behaviour' in the lang spec, you get the benefit of better optimizing compilers.
In java, you get a second step, and that's where java does its optimizations: At runtime. java.exe does it to class files; javac.exe is quite 'stupid' and optimizes almost nothing. This is on purpose; at runtime you can do a better job (for example, you can use some bookkeeping to track which of two branches is more commonly taken and thus branch predict better than a C app ever could) - it also means that cost/benefit analysis now results in: The lang spec should be clear as day.
So java code is never undefined behaviour?
Not so. Java has a memory model which includes a ton of undefined behaviour:
class X { int a, b; }
X instance = new X();
new Thread() { public void run() {
int a = instance.a;
int b = instance.b;
instance.a = 5;
instance.b = 6;
System.out.print(a);
System.out.print(b);
}}.start();
new Thread() { public void run() {
int a = instance.a;
int b = instance.b;
instance.a = 1;
instance.b = 2;
System.out.print(a);
System.out.print(b);
}}.start();
is undefined in java. It may print 0056, 0012, 0010, 0002, 5600, 0600, and many many more possibilities. Something like 5000 (which it could legally print) is hard to imagine: How can the read of a 'work' but the read of b then fail?
For the exact same reason your C code produces arbitrary answers:
Optimization.
The cost/benefit of 'hardcoding' in the spec exactly how this code would behave would have a large cost to it: You'd take away most of the room for optimization. So java paid the cost and now has a langspec that is ambigous whenever you modify/read the same fields from different threads without establish so-called 'comes-before' guards using e.g. synchronized.

When executed in C language the value of z evaluates to 20
It is not the truth. The compiler you use evaluates it to 20. Another one can evaluate it completely different way: https://godbolt.org/z/GcPsKh
This kind of behaviour is called Undefined Behaviour.
In your expression you have two problems.
Order of eveluation (except the logical expressions) is not specified in C (it is an Unspecified Behaviour)
In this expression there is also problem with the sequence point (Undefined Bahaviour)

How are JVM optimizations based on assumptions?

In section 12.3.3., "Unrealistic Sampling of Code Paths" the Java Concurrency In Practice book says:
In some cases, the JVM
may make optimizations based on assumptions that may only be true temporarily, and later back them out by invalidating the compiled code if they become untrue
I cannot understand above statement.
What are these JVM assumptions?
How does the JVM know whether the assumptions are true or untrue?
If the assumptions are untrue, does it influence the correctnes of my data?

The statement that you quoted has a footnote which gives an example:
For example, the JVM can use monomorphic call transformation to convert a virtual method call to a direct method call if no classes currently loaded override that method, but it invalidates the compiled code if a class is subsequently loaded that overrides the method.
The details are very, very, very complex here. So the following is a extremely oversimpilified example.
Imagine you have an interface:
interface Adder { int add(int x); }
The method is supposed to add a value to x, and return the result. Now imagine that there is a program that uses an implementation of this class:
class OneAdder implements Adder {
int add(int x) {
return x+1;
}
}
class Example {
void run() {
OneAdder a1 = new OneAdder();
int result = compute(a1);
System.out.println(result);
}
private int compute(Adder a) {
int sum = 0;
for (int i=0; i<100; i++) {
sum = a.add(sum);
}
return sum;
}
}
In this example, the JVM could do certain optimizations. A very low-level one is that it could avoid using a vtable for calling the add method, because there is only one implementation of this method in the given program. But it could even go further, and inline this only method, so that the compute method essentially becomes this:
private int compute(Adder a) {
int sum = 0;
for (int i=0; i<100; i++) {
sum += 1;
}
return sum;
}
and in principle, even this
private int compute(Adder a) {
return 100;
}
But the JVM can also load classes at runtime. So there may be a case where this optimization has already been done, and later, the JVM loads a class like this:
class TwoAdder implements Adder {
int add(int x) {
return x+2;
}
}
Now, the optimization that has been done to the compute method may become "invalid", because it's not clear whether it is called with a OneAdder or a TwoAdder. In this case, the optimization has to be undone.
This should answer 1. of your question.
Regarding 2.: The JVM keeps track of all the optimizations that have been done, of course. It knows that it has inlined the add method based on the assumption that there is only one implementation of this method. When it finds another implementation of this method, it has to undo the optimization.
Regarding 3.: The optimizations are done when the assumptions are true. When they become untrue, the optimization is undone. So this does not affect the correctness of your program.
Update:
Again, the example above was very simplified, referring to the footnote that was given in the book. For further information about the optimization techniques of the JVM, you may refer to https://wiki.openjdk.java.net/display/HotSpot/PerformanceTechniques . Specifically, the speculative (profile-based) techniques can probably be considered to be mostly based on "assumptions" - namely, on assumptions that are made based on the profiling data that has been collected so far.

Taking the quoted text in context, this section of the book is actually talking about the importance of using realistic text data (inputs) when you do performance testing.
Your questions:
What are these JVM assumptions?
I think the text is talking about two things:
On the one hand, it seems to be talking about optimizing based on the measurement of code paths. For example whether the "then" or "else" branch of an if statement is more likely to be executed. This can indeed result in generation of different code and is susceptible to producing sub-optimal code if the initial measurements are incorrect.
On the other hand, it also seems to be talking about optimizations that may turn out to be invalid. For example, at a certain point in time, there may be only one implementation of a given interface method that has been loaded by the JVM. On seeing this, the optimizer may decide to simplify the calling sequence to avoid polymorphic method dispatching. (The term used in the book for this a "monomorphic call transformation".) A bit latter, a second implementation may be loaded, causing the optimizer to back out that optimization.
The first of these cases only affects performance.
The second of these would affect correctness (as well as performance) if the optimizer didn't back out the optimization. But the optimizer does do that. So it only affects performance. (The methods containing the affected calls need to be re-optimized, and that affects overall performance.)
How do JVM know the assumptions are true or untrue?
In the first case, it doesn't.
In the second case, the problem is noticed when the JVM loads the 2nd method, and sees a flag on (say) the interface method that says that the optimizer has assumed that it is effectively a final method. On seeing this, the loader triggers the "back out" before any damage is done.
If the assumptions are untrue, does it influence the correctness of my data?
No it doesn't. Not in either case.
But the takeaway from the section is that the nature of your test data can influence performance measurements. And it is not simply a matter of size. The test data also needs to cause the application to behave the same way (take similar code paths) as it would behave in "real life".

The same computation inside a loop on a constant

In an interface I have the following:
public static byte[] and0xFFArray(byte[] array) {
for (int i = 0; i < array.length; i++) {
array[i] = (byte) (array[i] & 0xFF);
}
return array;
}
In another class I am calling the following:
while(true){
...
if (isBeforeTerminator(htmlInput, ParserI.and0xFFArray("포토".getBytes("UTF-8")), '<')) {
...
}
...
}
My question is, will the resultant array from String constant be computed once during compilation or will it be computed everytime the loop iterates?
Edit: I just noticed that the method doesn't make sense, but it doesn't affect the question.

I assume that you're referring to the result of
ParserI.and0xFFArray("포토".getBytes("UTF-8"))
Unless you explicitly cache/store the results somewhere, it'll be computed every time you call it.
You may want to consider something like:
byte[] parserI = ParserI.and0xFFArray("포토".getBytes("UTF-8"));
while (true) {
...
if (isBeforeTerminator(htmlInput, parserI, '<'))
...
To understand why compilers don't implement this automatically, keep in mind that you can't write a general algorithm to detect if a particular method will always return the same value as you'd quickly encounter things like the Halting Problem, so anything you try to write to do something like that would be massively complicated and wouldn't even work a good percent of the time. You'd also have to understand a fair amount about when a method will be called in order to work out a reasonable caching strategy. For example, is it worth persisting the cache after the for loop? You'd have to understand a fair amount about the program structure to know for sure.
It is possible that an optimizer could recognize that the results of a method are constant under certain limited circumstances (and I'm not sure the extent to which Java optimizers have actually implemented that), but you certainly can't count on that in the general case. The only way to know for sure if this is one of them is to look at the actual bytecode that the compiler produces, but I highly doubt that it's being as smart as you'd like it to here for the reasons I listed above. It's better to explicitly do the caching yourself as shown above.

Java, optimal calling of objects and methods

Lets say I have the following code:
private Rule getRuleFromResult(Fact result){
Rule output=null;
for (int i = 0; i < rules.size(); i++) {
if(rules.get(i).getRuleSize()==1){output=rules.get(i);return output;}
if(rules.get(i).getResultFact().getFactName().equals(result.getFactName())) output=rules.get(i);
}
return output;
}
Is it better to leave it as it is or to change it as follows:
private Rule getRuleFromResult(Fact result){
Rule output=null;
Rule current==null;
for (int i = 0; i < rules.size(); i++) {
current=rules.get(i);
if(current.getRuleSize()==1){return current;}
if(current.getResultFact().getFactName().equals(result.getFactName())) output=rules.get(i);
}
return output;
}
When executing, program goes each time through rules.get(i) as if it was the first time, and I think it, that in much more advanced example (let's say as in the second if) it takes more time and slows execution. Am I right?
Edit: To answer few comments at once: I know that in this particular example time gain will be super tiny, but it was just to get the general idea. I noticed I tend to have very long lines object.get.set.change.compareTo... etc and many of them repeat. In scope of whole code that time gain can be significant.

Your instinct is correct--saving intermediate results in a variable rather than re-invoking a method multiple times is faster. Often the performance difference will be too small to measure, but there's an even better reason to do this--clarity. By saving the value into a variable, you make it clear that you are intending to use the same value everywhere; if you re-invoke the method multiple times, it's unclear if you are doing so because you are expecting it to return different results on different invocations. (For instance, list.size() will return a different result if you've added items to list in between calls.) Additionally, using an intermediate variable gives you an opportunity to name the value, which can make the intention of the code clearer.

The only different between the two codes, is that in the first you may call twice rules.get(i) if the value is different one one.
So the second version is a little bit faster in general, but you will not feel any difference if the list is not bit.

It depends on the type of the data structure that "rules" object is. If it is a list then yes the second one is much faster as it does not need to search for rules(i) through rules.get(i). If it is a data type that allows you to know immediately rules.get(i) ( like an array) then it is the same..

In general yes it's probably a tiny bit faster (nano seconds I guess), if called the first time. Later on it will be probably be improved by the JIT compiler either way.
But what you are doing is so called premature optimization. Usually should not think about things that only provide a insignificant performance improvement.
What is more important is the readability to maintain the code later on.
You could even do more premature optimization like saving the length in a local variable, which is done by the for each loop internally. But again in 99% of cases it doesn't make sense to do it.

Is there any justifiable reason to use new and a constructor on a number class in Java?

Is there any justifiable reason to in Java something like
Long l = new Long(SOME_CONSTANT)
This creates an extra object and is tagged by FindBugs, and is obviously a bad practice. My question is whether there is ever a good reason to do so?
I previously asked this about String constructors and got a good answer, but that answer doesn't seem to apply to numbers.

Only if you want to make sure you get a unique instance, so practically never.
Some numbers can be cached when autoboxed (although Longs aren't guaranteed to be), which might cause problems. But any code that would break because of caching probably has deeper issues. Right now, I can't think of a single valid case for it.

My question is whether there is ever a good reason to do so?
You might still use it if you want to write code compatible with older JREs. valueOf(long) was only introduced in Java 1.5, so in Java 1.4 and before the constructor was the only way to go directly from a long to a Long. I expect it isn't deprecated because the constructor is still used internally.

The only thing I can think of is to make the boxing explicit, although the equivalent autoboxed code is actually compiled into Long.valueOf(SOME_CONSTANT) which can cache small values : (from jvm src)
public static Long valueOf(long l) {
final int offset = 128;
if (l >= -128 && l <= 127) { // will cache
return LongCache.cache[(int)l + offset];
}
return new Long(l);
}
. Not a big deal, but I dislike seeing code that continually boxes and unboxes without regard for type, which can get sloppy.
Functionally, though, I can't see a difference one way or the other. The new long will still compute as equals and hashcode equals to the autoboxed one, so I can't see how you could even make a functional distinction if you wanted to.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.