I read that JVM stores internally short, integer and long as 4 bytes. I read it from an article from the year 2000, so I don't know how true it is now.
For the newer JVMs, is there any performance gain in using short over integer/long? And did that part of the implementation has changed since 2000?
Thanks
Integer types are stored in many bytes, depending on the exact type :
byte on 8 bits
short on 16 bits, signed
int on 32 bits, signed
long on 64 bits, signed
See the spec here.
As for performance, it depends on what you're doing with them.
For example, if you're assigning a literal value to a byte or short, they will be upscaled to int because literal values are considered as ints by default.
byte b = 10; // upscaled to int, because "10" is an int
That's why you can't do :
byte b = 10;
b = b + 1; // Error, right member converted to int, cannot be reassigned to byte without a cast.
So, if you plan to use bytes or shorts to perform some looping, you won't gain anything.
for (byte b=0; b<10; b++)
{ ... }
On the other hand, if you're using arrays of bytes or shorts to store some data, you will obviously benefit from their reduced size.
byte[] bytes = new byte[1000];
int[] ints = new int[1000]; // 4X the size
So, my answer is : it depends :)
long 64 –9,223,372,036,854,775,808 to 9 ,223,372,036,854,775,807
int 32 –2,147,483,648 to 2,147,483,647
short 16 –32,768 to 32,767
byte 8 –128 to 127
Use what you need, I would think shorts are rarely used due to the small range and it is in big-endian format.
Any performance gain would be minimal, but like I said if your application requires a range more then that of a short go with int. The long type may be too extremly large for you; but again it all depends on your application.
You should only use short if you have a concern over space (memory) otherwise use int (in most cases). If you are creating arrays and such try it out by declaring arrays of type int and short. Short will use 1/2 of the space as opposed to the int. But if you run the tests based on speed / performance you will see little to no difference (if you are dealing with Arrays), in addition, the only thing you save is space.
Also being that a commentor mentioned long because a long is 64 bits. You will not be able to store the size of a long in 4 bytes (notice the range of long).
It's an implementation detail, but it's still true that for performance reasons, most JVMs will use a full word (or more) for each variable, since CPUs access memory in word units. If the JVM stored the variables in sub-word units and locations, it would actually be slower.
This means that a 32bit JVM will use 4 bytes for short (and even boolean) while a 64bit JVM will use 8 bytes. However, the same is not true for array elements.
There's basically no difference. One has to "confuse" the JITC a bit so that it doesn't recognize that the increment/decrement operations are self-cancelling and that the results aren't used. Do that and the three cases come out about equal. (Actually, short seems to be a tiny bit faster.)
public class ShortTest {
public static void main(String[] args){
// Do the inner method 5 times to see how it changes as the JITC attempts to
// do further optimizations.
for (int i = 0; i < 5; i++) {
calculate(i);
}
}
public static void calculate(int passNum){
System.out.println("Pass " + passNum);
// Broke into two (nested) loop counters so the total number of iterations could
// be large enough to be seen on the clock. (Though this isn't as important when
// the JITC over-optimizations are prevented.)
int M = 100000;
int N = 100000;
java.util.Random r = new java.util.Random();
short x = (short) r.nextInt(1);
short y1 = (short) (x + 1);
int y2 = x + 1;
long y3 = x + 1;
long time1=System.currentTimeMillis();
short s=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
s+=y1;
s-=1;
if (s > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time2=System.currentTimeMillis();
System.out.println("Time elapsed for shorts: "+(time2-time1) + " (" + time1 + "," + time2 + ")");
long time3=System.currentTimeMillis();
int in=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
in+=y2;
in-=1;
if (in > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time4=System.currentTimeMillis();
System.out.println("Time elapsed for ints: "+(time4-time3) + " (" + time3 + "," + time4 + ")");
long time5=System.currentTimeMillis();
long l=x;
for (int j = 0; j<M;j++) {
for(int i = 0; i<N;i++) {
l+=y3;
l-=1;
if (l > 100) {
System.out.println("Shouldn't be here");
}
}
}
long time6=System.currentTimeMillis();
System.out.println("Time elapsed for longs: "+(time6-time5) + " (" + time5 + "," + time6 + ")");
System.out.println(s+in+l);
}
}
Results:
C:\JavaTools>java ShortTest
Pass 0
Time elapsed for shorts: 59119 (1422405830404,1422405889523)
Time elapsed for ints: 45810 (1422405889524,1422405935334)
Time elapsed for longs: 47840 (1422405935335,1422405983175)
0
Pass 1
Time elapsed for shorts: 58258 (1422405983176,1422406041434)
Time elapsed for ints: 45607 (1422406041435,1422406087042)
Time elapsed for longs: 46635 (1422406087043,1422406133678)
0
Pass 2
Time elapsed for shorts: 31822 (1422406133679,1422406165501)
Time elapsed for ints: 39663 (1422406165502,1422406205165)
Time elapsed for longs: 37232 (1422406205165,1422406242397)
0
Pass 3
Time elapsed for shorts: 30392 (1422406242398,1422406272790)
Time elapsed for ints: 37949 (1422406272791,1422406310740)
Time elapsed for longs: 37634 (1422406310741,1422406348375)
0
Pass 4
Time elapsed for shorts: 31303 (1422406348376,1422406379679)
Time elapsed for ints: 36583 (1422406379680,1422406416263)
Time elapsed for longs: 38730 (1422406416264,1422406454994)
0
C:\JavaTools>java -version
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
I agree with user2391480, calculations with shorts seem to be way more expensive. Here is an example, where on my machine (Java7 64bit, Intel i7-3770, Windows 7) operations with shorts are around ~50 times slower than integers and longs.
public class ShortTest {
public static void main(String[] args){
calculate();
calculate();
}
public static void calculate(){
int N = 100000000;
long time1=System.currentTimeMillis();
short s=0;
for(int i = 0; i<N;i++) {
s+=1;
s-=1;
}
long time2=System.currentTimeMillis();
System.out.println("Time elapsed for shorts: "+(time2-time1));
long time3=System.currentTimeMillis();
int in=0;
for(int i = 0; i<N;i++) {
in+=1;
in-=1;
}
long time4=System.currentTimeMillis();
System.out.println("Time elapsed for ints: "+(time4-time3));
long time5=System.currentTimeMillis();
long l=0;
for(int i = 0; i<N;i++) {
l+=1;
l-=1;
}
long time6=System.currentTimeMillis();
System.out.println("Time elapsed for longs: "+(time6-time5));
System.out.println(s+in+l);
}
}
Output:
Time elapsed for shorts: 113
Time elapsed for ints: 2
Time elapsed for longs: 2
0
Time elapsed for shorts: 119
Time elapsed for ints: 2
Time elapsed for longs: 2
0
Note: specifying "1" to be a short (in order to avoid casting every time, as suggested by user Robotnik as a source of delay) does not seem to help, e.g.
short s=0;
short one = (short)1;
for(int i = 0; i<N;i++) {
s+=one;
s-=one;
}
EDIT: modified as per request of user Hot Licks in the comment, in order to invoke the calculate() method more than once outside the main method.
Calculations with a short type are extremely expensive.
Take the following useless loop for example:
short t=0;
//int t=0;
//long t=0;
for(many many times...)
{
t+=1;
t-=1;
}
If it is a short, it will take literally 1000s of times longer than if it's an int or a long.
Checked on 64-bit JVMs versions 6/7 on Linux
Related
I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.
Here's my code:
public class SIMDParse {
private static IntVector mul = IntVector.fromArray(
IntVector.SPECIES_512,
new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;
static {
filler = new byte[16];
for (int i = 0; i < 16; i++) {
filler[i] = zeroChar;
}
}
public static int parseInt(String str) {
boolean negative = str.charAt(0) == '-';
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
if (negative) {
bytes[0] = zeroChar;
}
bytes = ensureSize(bytes, width);
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
vec = vec.sub(zeroChar);
IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
ints = ints.mul(mul);
return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}
public static byte[] ensureSize(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, per - mod, arr.length);
System.arraycopy(filler, 0, newArr, 0, per - mod);
return newArr;
}
public static byte[] ensureSize2(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, 0, arr.length);
return newArr;
}
public static int indexOf(String s, char c) {
byte[] b = s.getBytes(StandardCharsets.UTF_8);
int width = ByteVector.SPECIES_MAX.length();
byte bChar = (byte) c;
b = ensureSize2(b, width);
for (int i = 0; i < b.length; i += width) {
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
if (pos != width) {
return pos + i;
}
}
return -1;
}
}
I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).
By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.
Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.
You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.
I'm assuming your times are totals (for 10k repeats each), not a per-call average.
7 us is impossibly fast for that.
"____".repeat(4000) is 16k bytes before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled memchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 625 us for 10k reps. (16000B / (64B/c) * 10000 reps / 4000 MHz). And yes, I'd expect a JVM to either call the native memchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling; if you're on Linux, your JVM might be calling it.
Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)
See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.
Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.
A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.
So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.
16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.
What SIMD is good for
for tasks like these.
No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.
Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)
Hmm. I found this post because I've hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays.
static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
Instant start = Instant.now();
for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
va.mul(vb);
System.arraycopy(va.mul(vb).toArray(), 0, output, i, SPECIES.length());
}
Instant finish = Instant.now();
System.out.println("vector duration " + iteration + ": " + Duration.between(start, finish).getNano());
}
The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).
On my first attempt the execution was taking more than 15 milliseconds to execute (compared with 0 for the scalar equivalent), even with a tiny array length (8 elements). On a hunch I added the iteration to see whether something had to warm up - and indeed, the first iteration still ALWAYS takes ages (44 ms for 65536 elements). Whilst most of the other iterations are reporting zero time, a few are taking around 15ms but they are randomly distributed (i.e. not always the same iteration index on each run). I sort of expect that (because I'm measuring real-time measurement and other stuff will be going on).
However, overall for an array size of 65536 elements, and 32 iterations, the total duration for the vector approach is 2-3 times longer than that for the scalar one.
As a test of Java 8's new implementation of streams and automatic parallelization, I ran the following simple test:
ArrayList<Integer> nums = new ArrayList<>();
for (int i=1; i<49999999; i++) nums.add(i);
int sum=0;
double begin, end;
begin = System.nanoTime();
for (Integer i : nums) sum += i;
end = System.nanoTime();
System.out.println( "1 core: " + (end-begin) );
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println( "8 cores: " + (end-begin) );
I thought summing up a series of integers would be able to take great advantage of all 8 cores, but the output looks like this:
1 core: 1.70552398E8
8 cores: 9.938507635E9
I am aware that nanoTime() has issues in multicore systems, but I doubt that's the problem here since I'm off by an order of magnitude.
Is the operation I'm performing so simple that the overhead required for reduce() is overcoming the advantage of multiple cores?
Your stream example has 2 unboxings (Integer.sum(int,int)) and one boxing (the resulting int has to be converted back to an Integer) for every number whereas the for loop has only one unboxing. So the two are not comparable.
When you plan to do calculations with Integers it's best to use an IntStream:
nums.stream().mapToInt(i -> i).sum();
That would give you a performance similar to that of the for loop. A parallel stream is still slower on my machine.
The fastest alternative would be this btw:
IntStream.rangeClosed(0, 49999999).sum();
An order of a magnitude faster and without the overhead of building a list first. It's only an alternative for this special use case, of course. But it demonstrates that it pays off to rethink an existing approach instead of merely "add a stream".
To properly compare this at all, you need to use similar overheads for both operations.
ArrayList<Integer> nums = new ArrayList<>();
for (int i = 1; i < 49999999; i++)
nums.add(i);
int sum = 0;
long begin, end;
begin = System.nanoTime();
sum = nums.stream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
This lands me
1 core: 769026020
8 cores: 538805164
which is in fact quicker for parallelStream(). (Note: I only have 4 cores, but parallelSteam() does not always use all of your cores anyways)
Another thing is boxing and unboxing. There is boxing for nums.add(i), and unboxing for everything going into Integer::sum which takes two ints. I converted this test to an array to remove that:
int[] nums = new int[49999999];
System.err.println("adding numbers");
for (int i = 1; i < 49999999; i++)
nums[i - 1] = i;
int sum = 0;
System.err.println("begin");
long begin, end;
begin = System.nanoTime();
sum = Arrays.stream(nums).reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = Arrays.stream(nums).parallel().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
And that gives an unexpected timing:
1 core: 68050642
8 cores: 154591290
It is much faster (1-2 orders of magnitude) for the linear reduce with regular ints, but only about 1/4 of the time for the parallel reduce and ends up being slower. I'm not sure why that is, but it is certainly interesting!
Did some profiling, turns out that the fork() method for doing parallel streams is very expensive because of the use of ThreadLocalRandom, which calls upon network interfaces for it's seed! This is very slow and is the only reason why parallelStream() is slower than stream()!
Some of my VisualVM data: (ignore the await() time, that's the method I used so I could track the program)
For first example: https://www.dropbox.com/s/z7qf2es0lxs6fvu/streams1.nps?dl=0
For second example: https://www.dropbox.com/s/f3ydl4basv7mln5/streams2.nps?dl=0
TL;DR: In your Integer case it looks like parallel wins, but there is some overhead for the int case that makes parallel slower.
this is the C program under Linux/GUN:
#include<stdio.h>
#include<sys/time.h>
#define Max 1024*1024
int main()
{
struct timeval start,end;
long int dis;
int i;
int m=0;
int a[Max];
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 1){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time1: %ld\n",dis);
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 16){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time2: %ld\n",dis);
return 0;
}
the output:
time1: 7074
time2: 234
it's a big distance
this Java program:
public class Cache1 {
public static void main(String[] args){
int a[] = new int[1024*1024*64];
long time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i++){
a[i] *= 3;
}
long time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i += 16){
a[i] *= 3;
}
time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
}
}
the output:
92
82
it's nealy the same
with the CPU Cache. why they hava so much difference? the Cpu Cache is invalid in C programing?
I hope you realize that the difference in units of time in those tests is 10^3. C code is order of magnitude faster than Java code.
In C code there should be a[i] instead of a[Max].
As for cache: since you access only one memory location in your C code (which triggers undefined behavior) your C test is completely invalid.
And even if it were correct, your method is flawed. It is quite possible that the multiplication operations and even the whole loops were skipped completely by C copiler, since nothing depends on their outcome.
The result where first run takes long, and the second takes less time, is expected. Data has to be loaded to cache anyway, and that takes time. Once it is loaded, operations on that data take less time.
Java may either not use cache at all (not likely) or preload the whole array to cache even before the loops are executed. That would explain equal execution times.
You have three cache sizes, these are typically
L1: 32 KB (data), 4 clock cycles
L2: 256KB, 10-11 clock cycles
L3: 3-24 MB. 40 - 75 clock cycles.
Anything larger than this will not fit into the cache as if you just scroll through memory it will be like they are not there.
I suggest you write a test which empirically works out the CPU cache sizes as a good exercise to help you understand this. BTW You don't need to use *= to exercise the cache as this exercises the ALU. Perhaps there is a simpler operation you can use ;)
In the case of your Java code, most likely it is not compiled yet so you are seeing the speed of the interperator, not the memory accesses.
I suggest you run the test repeatedly on smaller memory sizes for at least 2 seconds and take the average.
Why might this code
long s, e, sum1 = 0, sum2 = 0, TRIALS = 10000000;
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
System.out.println(sum1 / TRIALS);
System.out.println(sum2 / TRIALS);
produce this result
-60
61
"on my machine?"
EDIT:
Sam I am's answer points to the nanoSecond() documentation which helps, but now, more precisely, why does the result consistently favor the first sum?
"my machine":
JavaSE-1.7, Eclipse
Win 7 x64, AMD Athlon II X4 635
switching the order inside the loop produces reverse results
for(int i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
}
61
-61
looking (e-s) before adding it to sum1 makes sum1 positive.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
temp = e-s;
if(temp < 0)
count++;
sum1 += temp;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
61
61
And as Andrew Alcock points out, sum1 += -s + e produces the expected outcome.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += -s + e;
s = System.nanoTime();
e = System.nanoTime();
sum2 += -s + e;
}
61
61
A few other tests: http://pastebin.com/QJ93NZxP
This answer is supposition. If you update your question with some details about your environment, it's likely that someone else can give a more detailed, grounded answer.
The nanoTime() function works by accessing some high-resolution timer with low access latency. On the x86, I believe this is the Time Stamp Counter, which is driven by the basic clock cycle of the machine.
If you're seeing consistent results of +/- 60 ns, then I believe you're simply seeing the basic interval of the timer on your machine.
However, what about the negative numbers? Again, supposition, but if you read the Wikipedia article, you'll see a comment that Intel processors might re-order the instructions.
In conjunction with roundar, we ran a number of tests on this code. In summary, the effect disappeared when:
Running the same code in interpreted mode (-Xint)
Changing the aggregation logic order from sum += e - s to sum += -s + e
Running on some different architectures or different VMs (eg I ran on Java 6 on Mac)
Placing logging statements inspecting s and e
Performing additional arithmetic on s and e
In addition, the effect is not threading:
There are no additional threads spawned
Only local variables are involved
This effect is 100% reproducible in roundar's environment, and always results in precisely the same timings, namely +61 and -61.
The effect is not a timing issue because:
The execution takes place over 10m iterations
This effect is 100% reproducible in roundar's environment
The result is precisely the same timings, namely +61 and -61, on all iterations.
Given the above, I believe we have a bug in the hotspot module of Java VM. The code as written should return positive results, but does not.
straight from oracle's documentation
In short: the frequency of updating the values can cause results to differ.
nanoTime
public static long nanoTime()
Returns the current value of the most precise available system timer, in nanoseconds.
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
The value returned represents nanoseconds since some fixed but arbitrary time
(perhaps in the future, so values may be negative). This method
provides nanosecond precision, but not necessarily nanosecond
accuracy. No guarantees are made about how frequently values change.
Differences in successive calls that span greater than approximately
292 years (263 nanoseconds) will not accurately compute elapsed time
due to numerical overflow.
For example, to measure how long some code takes to execute:
long startTime = System.nanoTime();
// ... the code being measured ...
long estimatedTime = System.nanoTime() - startTime;
Returns:
The current value of the system timer, in nanoseconds.
Since:
1.5
I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.
Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?
Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LightAndFastRange extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = (0 to 10000000).filter(_ % 1000000 == 0).toList
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
It takes 141 milliseconds with a JVM size of 27m
In comparison, converting to List affects performance dramatically:
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeLinkedList extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2)
}
It takes 8514-10896 ms with 460-455 m
In contrast, this Java implementation uses an array of primitives
import static java.util.concurrent.TimeUnit.*;
public class LargePrimitiveArray {
public static void main(String[] args){
long start = System.nanoTime();
int[] elements = new int[10000000];
for(int i = 0; i < 10000000; i++){
elements[i] = i;
}
for(int i = 0; i < 10000000; i++){
if(elements[i] % 1000000 == 0) {
System.out.println("x: " + elements[i]);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 116ms with JVM size of 59m
Java List of Integers
import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;
public class LargeArrayList {
public static void main(String[] args){
long start = System.nanoTime();
List<Integer> elements = new ArrayList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 3993 ms with JVM size of 283m
My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.
All tests running on Mac OS X Snow Leopard,
Java 6u26 64-Bit Server
Scala 2.9.1.final
EDIT:
for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)
import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;
public class LargeLinkedList {
public static void main(String[] args){
LargeLinkedList test = new LargeLinkedList();
long start = System.nanoTime();
List<Integer> elements = test.createElements();
test.findElementsToPrint(elements);
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
private List<Integer> createElements(){
List<Integer> elements = new LinkedList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
return elements;
}
private void findElementsToPrint(List<Integer> elements){
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
}
}
Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.
finally, a LargeArrayBuffer
import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeArrayBuffer extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = {
val size = 10000000
var items = new ArrayBuffer[Int](size)
(0 to size).foreach (items += _)
items.filter(_ % 1000000 == 0).toList
}
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
Taking about 2145 ms and 375 mb
Thanks a lot for the answers.
Oh So Many Things going on here!!!
Let's start with Java int[]. Arrays in Java are the only collection that is not type erased. The run time representation of an int[] is different from the run time representation of Object[], in that it actually uses int directly. Because of that, there's no boxing involved in using it.
In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.
In contrast, an ArrayList<Integer> -- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.
So, back to Scala, for the second example where you manipulate a List. Now, Scala's List is much more like Java's LinkedList than the grossly misnamed ArrayList. Each element of a List is composed of an object called Cons, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList is even more so for List
Finally, Range. A Range is a sequence of integers with a lower and an upper boundary, plus a step. A Range of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last, numRangeElements) and two other ints used for lazy val thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.
Now, because a Range is a Seq[Int], it still has to go through boxing for most uses: an int will be converted into an Integer and then back into an int again, which is sadly wasteful.
Cons Size Calculation
So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:
Java uses 8 bytes for normal objects, and 12 for object arrays, for "housekeeping" information (what's the class of this object, etc).
Objects are allocated in 8 bytes chunks. If your object is smaller than that, it will be padded to complement it.
I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:
public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;
References are at least 4 bytes (could be more on 64 bits JVM). So we have:
8 bytes Java header
4 bytes hd
4 bytes tl
Which makes it only 16 bytes long. Pretty good, actually. In the example, hd will point to an Integer object, which I assume is 8 bytes long. As for tl, it points to another cons, which we are already counting.
I'm going to revise the estimates, with actual data where possible.
In the first example you create a linked list with 10 elements by computing the steps of the range.
In the second example you create a linked list with 10 millions of elements and filter it down to a new linked list with 10 elements.
In the third example you create an array-backed buffer with 10 millions of elements which you traverse and print, no new array-backed buffer is created.
Conclusion:
Every piece of code does something different, that's why the performance varies greatly.
This is an educated guess ...
I think it is because in the fast version the Scala compiler is able to translate the key statement into something like this (in Java):
List<Integer> millions = new ArrayList<Integer>();
for (int i = 0; i <= 10000000; i++) {
if (i % 1000000 == 0) {
millions.add(i);
}
}
As you can see, (0 to 10000000) doesn't generate an intermediate list of 10,000,000 Integer objects.
By contrast, in the slow version the Scala compiler is not able to do that optimization, and is generating that list.
(The intermediate data structure could possibly be an int[], but the observed JVM size suggests that it is not.)
It's hard to read the Scala source on my iPad, but it looks like Range's constructor isn't actually producing a list, just remembering the start, increment and end. It uses these to produce its values on request, so that iterating over a range is a lot closer to a simple for loop than examining the elements of an array.
As soon as you say range.toList you are forcing Scala to produce a linked list of the 'values' in the range (allocating memory for both the values and the links), and then you are iterating over that. Being a linked list the performance of this is going to be worse than your Java ArrayList example.