Empty speed test, unexpected result

Empty speed test, unexpected result - java

Why might this code
long s, e, sum1 = 0, sum2 = 0, TRIALS = 10000000;
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
System.out.println(sum1 / TRIALS);
System.out.println(sum2 / TRIALS);
produce this result
-60
61
"on my machine?"
EDIT:
Sam I am's answer points to the nanoSecond() documentation which helps, but now, more precisely, why does the result consistently favor the first sum?
"my machine":
JavaSE-1.7, Eclipse
Win 7 x64, AMD Athlon II X4 635
switching the order inside the loop produces reverse results
for(int i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
}
61
-61
looking (e-s) before adding it to sum1 makes sum1 positive.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
temp = e-s;
if(temp < 0)
count++;
sum1 += temp;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
61
61
And as Andrew Alcock points out, sum1 += -s + e produces the expected outcome.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += -s + e;
s = System.nanoTime();
e = System.nanoTime();
sum2 += -s + e;
}
61
61
A few other tests: http://pastebin.com/QJ93NZxP

This answer is supposition. If you update your question with some details about your environment, it's likely that someone else can give a more detailed, grounded answer.
The nanoTime() function works by accessing some high-resolution timer with low access latency. On the x86, I believe this is the Time Stamp Counter, which is driven by the basic clock cycle of the machine.
If you're seeing consistent results of +/- 60 ns, then I believe you're simply seeing the basic interval of the timer on your machine.
However, what about the negative numbers? Again, supposition, but if you read the Wikipedia article, you'll see a comment that Intel processors might re-order the instructions.

In conjunction with roundar, we ran a number of tests on this code. In summary, the effect disappeared when:
Running the same code in interpreted mode (-Xint)
Changing the aggregation logic order from sum += e - s to sum += -s + e
Running on some different architectures or different VMs (eg I ran on Java 6 on Mac)
Placing logging statements inspecting s and e
Performing additional arithmetic on s and e
In addition, the effect is not threading:
There are no additional threads spawned
Only local variables are involved
This effect is 100% reproducible in roundar's environment, and always results in precisely the same timings, namely +61 and -61.
The effect is not a timing issue because:
The execution takes place over 10m iterations
This effect is 100% reproducible in roundar's environment
The result is precisely the same timings, namely +61 and -61, on all iterations.
Given the above, I believe we have a bug in the hotspot module of Java VM. The code as written should return positive results, but does not.

straight from oracle's documentation
In short: the frequency of updating the values can cause results to differ.
nanoTime
public static long nanoTime()
Returns the current value of the most precise available system timer, in nanoseconds.
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
The value returned represents nanoseconds since some fixed but arbitrary time
(perhaps in the future, so values may be negative). This method
provides nanosecond precision, but not necessarily nanosecond
accuracy. No guarantees are made about how frequently values change.
Differences in successive calls that span greater than approximately
292 years (263 nanoseconds) will not accurately compute elapsed time
due to numerical overflow.
For example, to measure how long some code takes to execute:
long startTime = System.nanoTime();
// ... the code being measured ...
long estimatedTime = System.nanoTime() - startTime;
Returns:
The current value of the system timer, in nanoseconds.
Since:
1.5

Related

Java parallelStream() with reduce() not improving performance

As a test of Java 8's new implementation of streams and automatic parallelization, I ran the following simple test:
ArrayList<Integer> nums = new ArrayList<>();
for (int i=1; i<49999999; i++) nums.add(i);
int sum=0;
double begin, end;
begin = System.nanoTime();
for (Integer i : nums) sum += i;
end = System.nanoTime();
System.out.println( "1 core: " + (end-begin) );
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println( "8 cores: " + (end-begin) );
I thought summing up a series of integers would be able to take great advantage of all 8 cores, but the output looks like this:
1 core: 1.70552398E8
8 cores: 9.938507635E9
I am aware that nanoTime() has issues in multicore systems, but I doubt that's the problem here since I'm off by an order of magnitude.
Is the operation I'm performing so simple that the overhead required for reduce() is overcoming the advantage of multiple cores?

Your stream example has 2 unboxings (Integer.sum(int,int)) and one boxing (the resulting int has to be converted back to an Integer) for every number whereas the for loop has only one unboxing. So the two are not comparable.
When you plan to do calculations with Integers it's best to use an IntStream:
nums.stream().mapToInt(i -> i).sum();
That would give you a performance similar to that of the for loop. A parallel stream is still slower on my machine.
The fastest alternative would be this btw:
IntStream.rangeClosed(0, 49999999).sum();
An order of a magnitude faster and without the overhead of building a list first. It's only an alternative for this special use case, of course. But it demonstrates that it pays off to rethink an existing approach instead of merely "add a stream".

To properly compare this at all, you need to use similar overheads for both operations.
ArrayList<Integer> nums = new ArrayList<>();
for (int i = 1; i < 49999999; i++)
nums.add(i);
int sum = 0;
long begin, end;
begin = System.nanoTime();
sum = nums.stream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
This lands me
1 core: 769026020
8 cores: 538805164
which is in fact quicker for parallelStream(). (Note: I only have 4 cores, but parallelSteam() does not always use all of your cores anyways)
Another thing is boxing and unboxing. There is boxing for nums.add(i), and unboxing for everything going into Integer::sum which takes two ints. I converted this test to an array to remove that:
int[] nums = new int[49999999];
System.err.println("adding numbers");
for (int i = 1; i < 49999999; i++)
nums[i - 1] = i;
int sum = 0;
System.err.println("begin");
long begin, end;
begin = System.nanoTime();
sum = Arrays.stream(nums).reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = Arrays.stream(nums).parallel().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
And that gives an unexpected timing:
1 core: 68050642
8 cores: 154591290
It is much faster (1-2 orders of magnitude) for the linear reduce with regular ints, but only about 1/4 of the time for the parallel reduce and ends up being slower. I'm not sure why that is, but it is certainly interesting!
Did some profiling, turns out that the fork() method for doing parallel streams is very expensive because of the use of ThreadLocalRandom, which calls upon network interfaces for it's seed! This is very slow and is the only reason why parallelStream() is slower than stream()!
Some of my VisualVM data: (ignore the await() time, that's the method I used so I could track the program)
For first example: https://www.dropbox.com/s/z7qf2es0lxs6fvu/streams1.nps?dl=0
For second example: https://www.dropbox.com/s/f3ydl4basv7mln5/streams2.nps?dl=0
TL;DR: In your Integer case it looks like parallel wins, but there is some overhead for the int case that makes parallel slower.

The CPU Cache is invalid in C Programing?

this is the C program under Linux/GUN:
#include<stdio.h>
#include<sys/time.h>
#define Max 1024*1024
int main()
{
struct timeval start,end;
long int dis;
int i;
int m=0;
int a[Max];
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 1){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time1: %ld\n",dis);
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 16){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time2: %ld\n",dis);
return 0;
}
the output:
time1: 7074
time2: 234
it's a big distance
this Java program:
public class Cache1 {
public static void main(String[] args){
int a[] = new int[1024*1024*64];
long time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i++){
a[i] *= 3;
}
long time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i += 16){
a[i] *= 3;
}
time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
}
}
the output:
92
82
it's nealy the same
with the CPU Cache. why they hava so much difference? the Cpu Cache is invalid in C programing?

I hope you realize that the difference in units of time in those tests is 10^3. C code is order of magnitude faster than Java code.
In C code there should be a[i] instead of a[Max].
As for cache: since you access only one memory location in your C code (which triggers undefined behavior) your C test is completely invalid.
And even if it were correct, your method is flawed. It is quite possible that the multiplication operations and even the whole loops were skipped completely by C copiler, since nothing depends on their outcome.
The result where first run takes long, and the second takes less time, is expected. Data has to be loaded to cache anyway, and that takes time. Once it is loaded, operations on that data take less time.
Java may either not use cache at all (not likely) or preload the whole array to cache even before the loops are executed. That would explain equal execution times.

You have three cache sizes, these are typically
L1: 32 KB (data), 4 clock cycles
L2: 256KB, 10-11 clock cycles
L3: 3-24 MB. 40 - 75 clock cycles.
Anything larger than this will not fit into the cache as if you just scroll through memory it will be like they are not there.
I suggest you write a test which empirically works out the CPU cache sizes as a good exercise to help you understand this. BTW You don't need to use *= to exercise the cache as this exercises the ALU. Perhaps there is a simpler operation you can use ;)
In the case of your Java code, most likely it is not compiled yet so you are seeing the speed of the interperator, not the memory accesses.
I suggest you run the test repeatedly on smaller memory sizes for at least 2 seconds and take the average.

A faster implementation for Math.abs(a - b) - Math.abs(c - d)?

I have a Java method that repeatedly evaluates the following expression in a very tight loop with a large number of repetitions:
Math.abs(a - b) - Math.abs(c - d)
a, b, c and d are long values that can span the whole range of their type. They are different in each loop iteration and they do not satisfy any invariant that I know of.
The profiler indicates that a significant portion of the processor time is spent in this method. While I will pursue other avenues of optimization first, I was wondering if there is a smarter way to calculate the aforementioned expression.
Apart from inlining the Math.abs() calls manually for a very slight (if any) performance gain, is there any mathematical trick that I could use to speed-up the evaluation of this expression?

I suspect the profiler isn't giving you a true result as it trying to profile (and thus adding over head to) such a trivial "method". Without the profile Math.abs can be turned into a small number of machine code instructions, and you won't be able to make it faster than that.
I suggest you do a micro-benchmark to confirm this. I would expect loading the data to be an order of magnitude more expensive.
long a = 10, b = 6, c = -2, d = 3;
int runs = 1000 * 1000 * 1000;
long start = System.nanoTime();
for (int i = 0; i < runs; i += 2) {
long r = Math.abs(i - a) - Math.abs(c - i);
long r2 = Math.abs(i - b) - Math.abs(d - i);
if (r + r2 < Integer.MIN_VALUE) throw new AssertionError();
}
long time = System.nanoTime() - start;
System.out.printf("Took an average of %.1f ns per abs-abs. %n", (double) time / runs);
prints
Took an average of 0.9 ns per abs-abs.

I ended up using this little method:
public static long diff(final long a, final long b, final long c, final long d) {
final long a0 = (a < b)?(b - a):(a - b);
final long a1 = (c < d)?(d - c):(c - d);
return a0 - a1;
}
I experienced a measurable performance increase - about 10-15% for the whole application. I believe this is mostly due to:
The elimination of a method call: Rather than calling Math.abs() twice, I call this method once. Sure, static method calls are not inordinately expensive, but they still have an impact.
The elimination of a couple of negation operations: This may be offset by the slightly increased size of the code, but I'll happily fool myself into believing that it actually made a difference.
EDIT:
It seems that it's actually the other way around. Explicitly inlining the code does not seem to impact the performance in my micro-benchmark. Changing the way the absolute values are calculated does...

You can always try to unroll the functions and hand optimize, if you don't get more cache misses it might be faster.
If I got the unrolling right it could be something like this:
if(a<b)
{
if(c<d)
{
r=b-a-d+c;
}
else
{
r=b-a+d-c;
}
}
else
{
if(c<d)
{
r=a-b-d+c;
}
else
{
r=a-b+d-c;
}
}

are you sure its the method itself causes the problem? Maybe its an enormous amount of invocation of this method and you just see the aggregated results (like TIME_OF_METHOD_EXECUTION X NUMBER_OF_INVOCATIONS) in your profiler?

Enhanced for loop performance worse than traditional indexed lookup?

I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes
I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...
Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.
I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.
Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.
Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.
TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.

The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.
public static void main(String... args) {
int testLength = 100 * 1000 * 1000;
String[] stringArray = new String[testLength];
Arrays.fill(stringArray, "a");
List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
{
long start = System.nanoTime();
long total = 0;
for (String str : stringArray) {
total += str.length();
}
System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
String str = stringList.get(i);
total += str.length();
}
System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (String str : stringList) {
total += str.length();
}
System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
}
When run with one billion entries entries prints (using Java 6 update 26.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
When run with one billion entries entries prints (using OpenJDK 7.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
i.e. exactly the same. ;)

Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.
Tell me if you're interested in more explanations. Although it is all in the articles I referred to.

GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.
I ran GravityBringer's test twice on my machine, -server mode
50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376
The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.
The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.
The ultimate answer of course is to test your particular program in your particular environment.
Though it's not helpful to play agnostic whenever a benchmark question is raised.

The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:
int testSize = 5000000;
ArrayList<Double> list = new ArrayList<Double>();
Double[] arr = new Double[testSize];
//set up the data - make sure data doesn't have patterns
//or anything compiler could somehow optimize
for (int i=0;i<testSize; i++)
{
double someNumber = Math.random();
list.add(someNumber);
arr[i] = someNumber;
}
//ArrayList foreach
long time = System.nanoTime();
double total1 = 0;
for (Double k: list)
{
total1 += k;
}
System.out.println (System.nanoTime()-time);
//ArrayList get() method
time = System.nanoTime();
double total2 = 0;
for (int i=0;i<testSize;i++)
{
total2 += list.get(i);
}
System.out.println (System.nanoTime()-time);
//array foreach
time = System.nanoTime();
double total3 = 0;
for (Double k: arr)
{
total3 += k;
}
System.out.println (System.nanoTime()-time);
//array indexing
time = System.nanoTime();
double total4 = 0;
for (int i=0;i<testSize;i++)
{
total4 += arr[i];
}
System.out.println (System.nanoTime()-time);
//would be strange if different values were produced,
//but no, all these are the same, of course
System.out.println (total1);
System.out.println (total2);
System.out.println (total3);
System.out.println (total4);
The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.
The runtimes are (in nanoseconds):
ArrayList foreach: 248,351,782
ArrayList get(): 60,657,907
array foreach: 27,381,576
array direct indexing: 27,468,091

java short,integer,long performance

I read that JVM stores internally short, integer and long as 4 bytes. I read it from an article from the year 2000, so I don't know how true it is now.
For the newer JVMs, is there any performance gain in using short over integer/long? And did that part of the implementation has changed since 2000?
Thanks

Integer types are stored in many bytes, depending on the exact type :
byte on 8 bits
short on 16 bits, signed
int on 32 bits, signed
long on 64 bits, signed
See the spec here.
As for performance, it depends on what you're doing with them.
For example, if you're assigning a literal value to a byte or short, they will be upscaled to int because literal values are considered as ints by default.
byte b = 10; // upscaled to int, because "10" is an int
That's why you can't do :
byte b = 10;
b = b + 1; // Error, right member converted to int, cannot be reassigned to byte without a cast.
So, if you plan to use bytes or shorts to perform some looping, you won't gain anything.
for (byte b=0; b<10; b++)
{ ... }
On the other hand, if you're using arrays of bytes or shorts to store some data, you will obviously benefit from their reduced size.
byte[] bytes = new byte[1000];
int[] ints = new int[1000]; // 4X the size
So, my answer is : it depends :)

long 64 –9,223,372,036,854,775,808 to 9 ,223,372,036,854,775,807
int 32 –2,147,483,648 to 2,147,483,647
short 16 –32,768 to 32,767
byte 8 –128 to 127
Use what you need, I would think shorts are rarely used due to the small range and it is in big-endian format.
Any performance gain would be minimal, but like I said if your application requires a range more then that of a short go with int. The long type may be too extremly large for you; but again it all depends on your application.
You should only use short if you have a concern over space (memory) otherwise use int (in most cases). If you are creating arrays and such try it out by declaring arrays of type int and short. Short will use 1/2 of the space as opposed to the int. But if you run the tests based on speed / performance you will see little to no difference (if you are dealing with Arrays), in addition, the only thing you save is space.
Also being that a commentor mentioned long because a long is 64 bits. You will not be able to store the size of a long in 4 bytes (notice the range of long).

It's an implementation detail, but it's still true that for performance reasons, most JVMs will use a full word (or more) for each variable, since CPUs access memory in word units. If the JVM stored the variables in sub-word units and locations, it would actually be slower.
This means that a 32bit JVM will use 4 bytes for short (and even boolean) while a 64bit JVM will use 8 bytes. However, the same is not true for array elements.

There's basically no difference. One has to "confuse" the JITC a bit so that it doesn't recognize that the increment/decrement operations are self-cancelling and that the results aren't used. Do that and the three cases come out about equal. (Actually, short seems to be a tiny bit faster.)
public class ShortTest {
public static void main(String[] args){
// Do the inner method 5 times to see how it changes as the JITC attempts to
// do further optimizations.
for (int i = 0; i < 5; i++) {
calculate(i);
}
}
public static void calculate(int passNum){
System.out.println("Pass " + passNum);
// Broke into two (nested) loop counters so the total number of iterations could
// be large enough to be seen on the clock. (Though this isn't as important when
// the JITC over-optimizations are prevented.)
int M = 100000;
int N = 100000;
java.util.Random r = new java.util.Random();
short x = (short) r.nextInt(1);
short y1 = (short) (x + 1);
int y2 = x + 1;
long y3 = x + 1;
long time1=System.currentTimeMillis();
short s=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
s+=y1;
s-=1;
if (s > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time2=System.currentTimeMillis();
System.out.println("Time elapsed for shorts: "+(time2-time1) + " (" + time1 + "," + time2 + ")");
long time3=System.currentTimeMillis();
int in=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
in+=y2;
in-=1;
if (in > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time4=System.currentTimeMillis();
System.out.println("Time elapsed for ints: "+(time4-time3) + " (" + time3 + "," + time4 + ")");
long time5=System.currentTimeMillis();
long l=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
l+=y3;
l-=1;
if (l > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time6=System.currentTimeMillis();
System.out.println("Time elapsed for longs: "+(time6-time5) + " (" + time5 + "," + time6 + ")");
System.out.println(s+in+l);
}
}
Results:
C:\JavaTools>java ShortTest
Pass 0
Time elapsed for shorts: 59119 (1422405830404,1422405889523)
Time elapsed for ints: 45810 (1422405889524,1422405935334)
Time elapsed for longs: 47840 (1422405935335,1422405983175)
0
Pass 1
Time elapsed for shorts: 58258 (1422405983176,1422406041434)
Time elapsed for ints: 45607 (1422406041435,1422406087042)
Time elapsed for longs: 46635 (1422406087043,1422406133678)
0
Pass 2
Time elapsed for shorts: 31822 (1422406133679,1422406165501)
Time elapsed for ints: 39663 (1422406165502,1422406205165)
Time elapsed for longs: 37232 (1422406205165,1422406242397)
0
Pass 3
Time elapsed for shorts: 30392 (1422406242398,1422406272790)
Time elapsed for ints: 37949 (1422406272791,1422406310740)
Time elapsed for longs: 37634 (1422406310741,1422406348375)
0
Pass 4
Time elapsed for shorts: 31303 (1422406348376,1422406379679)
Time elapsed for ints: 36583 (1422406379680,1422406416263)
Time elapsed for longs: 38730 (1422406416264,1422406454994)
0
C:\JavaTools>java -version
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

I agree with user2391480, calculations with shorts seem to be way more expensive. Here is an example, where on my machine (Java7 64bit, Intel i7-3770, Windows 7) operations with shorts are around ~50 times slower than integers and longs.
public class ShortTest {
public static void main(String[] args){
calculate();
calculate();
}
public static void calculate(){
int N = 100000000;
long time1=System.currentTimeMillis();
short s=0;
for(int i = 0; i<N;i++) {
s+=1;
s-=1;
}
long time2=System.currentTimeMillis();
System.out.println("Time elapsed for shorts: "+(time2-time1));
long time3=System.currentTimeMillis();
int in=0;
for(int i = 0; i<N;i++) {
in+=1;
in-=1;
}
long time4=System.currentTimeMillis();
System.out.println("Time elapsed for ints: "+(time4-time3));
long time5=System.currentTimeMillis();
long l=0;
for(int i = 0; i<N;i++) {
l+=1;
l-=1;
}
long time6=System.currentTimeMillis();
System.out.println("Time elapsed for longs: "+(time6-time5));
System.out.println(s+in+l);
}
}
Output:
Time elapsed for shorts: 113
Time elapsed for ints: 2
Time elapsed for longs: 2
0
Time elapsed for shorts: 119
Time elapsed for ints: 2
Time elapsed for longs: 2
0
Note: specifying "1" to be a short (in order to avoid casting every time, as suggested by user Robotnik as a source of delay) does not seem to help, e.g.
short s=0;
short one = (short)1;
for(int i = 0; i<N;i++) {
s+=one;
s-=one;
}
EDIT: modified as per request of user Hot Licks in the comment, in order to invoke the calculate() method more than once outside the main method.

Calculations with a short type are extremely expensive.
Take the following useless loop for example:
short t=0;
//int t=0;
//long t=0;
for(many many times...)
{
t+=1;
t-=1;
}
If it is a short, it will take literally 1000s of times longer than if it's an int or a long.
Checked on 64-bit JVMs versions 6/7 on Linux

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.