I have an ArrayList around 3483 object that ready to merge in database.
When I use lambda expressions - it takes 85122 ms.
But for(Obj o : list) takes only 25 ms. Why Java8 takes 3404 times more time ?
List<CIBSubjectData> list1 = .....
list1.forEach(data ->
merge(data)
);
for (CIBSubjectData data : list1) {
merge(data);
}
I belive you are not using a proper microbenchmark setting. You are comparing the warmup of the bytecode instrumentation framework (ASM which is used to generated the lambda bytecode at runtime) + lambda execution time with the execution time of the loop.
Check this answer for performance-difference-between-java-8-lambdas-and-anonymous-inner-classes and the linked document. The linked document has a deep insight about the processing under the hood.
edit To provide a small snippet to demonstrate the above.
public class Warmup {
static int dummy;
static void merge(String s) {
dummy += s.length();
dummy++;
dummy -= s.length();
}
public static void main(String[] args) throws IOException {
List<String> list1 = new ArrayList<>();
Random rand = new Random(1);
for (int i = 0; i < 100_000; i++) {
list1.add(Long.toString(rand.nextLong()));
}
// this will boostrap the bytecode instrumentation
// Stream.of("foo".toCharArray()).forEach(System.out::println);
long start = System.nanoTime();
list1.forEach(data -> merge(data));
long end = System.nanoTime();
System.out.printf("duration: %d%n", end - start);
System.out.println(dummy);
}
}
If you run the code as it is posted the printed duration on my machine is
duration: 71694425
If you uncomment the line Stream.of(... (which is only there to use the the bytecode instrumentation framework the first time) the printed duration is
duration: 7516086
Which is only around 10% of the initial run.
note Only to be explicit. Don't use benchmarks like the above. Have a look at jmh for such a requirement.
Related
I have been looking in to the volatile keyword, and how it can be used to manipulate the way memory is stored and accessed from the CPU cache. I am using a simple test program to explore how storing 2 variables, each being accessed by a concurrently executing thread, on different cache lines improves read/write speed. Obviously, this method would never be used in the real world, however, I found the following program very useful in demonstrating and understanding how data is stored in the CPU cache, albeit very crudely:
public class Volatile {
private volatile int a = 0;
private long dummy1 = 0;
private long dummy2 = 0;
private long dummy3 = 0;
private long dummy4 = 0;
private volatile int b = 0;
private static long lastA;
private static long lastB;
public static void main(String[] args) {
final Volatile instance = new Volatile();
new Thread(new Runnable(){
#Override
public void run() {
lastA = System.nanoTime();
while(true){
instance.a++;
if(instance.a % 100_000_000 == 0){
System.out.println("A: " + (System.nanoTime() - lastA) / 1000000 + "ms");
lastA = System.nanoTime();
instance.a = 0;
}
}
}
}).start();
new Thread(new Runnable(){
#Override
public void run() {
lastB = System.nanoTime();
while(true){
instance.b++;
if(instance.b % 100_000_000 == 0){
System.out.println("B: " + (System.nanoTime() - lastB) / 1000000 + "ms");
lastB = System.nanoTime();
instance.b = 0;
}
}
}
}).start();
}
}
Here, the code is padding dummy variables between a and b such that they will be stored on separate cache lines, and the 2 threads accessing them will not clash. The results produced by this program are as expected, and the time taken to increment each variable to 100_000_000 is approximately 600-700 ms for my CPU. Removing the dummy variables increases this time to approximately 3000-4000 ms.
This is where I encounter some behavior I do not understand.
I replicated the code exactly in a separate class, however, I replaced the anonymous inner class passed into the thread creation with a lambda expression:
i.e
final VolatileLambda instance = new VolatileLambda();
new Thread(() -> {
lastA = System.nanoTime();
while(true){
and
new Thread(() -> {
lastB = System.nanoTime();
while(true){
When I ran this second program with lambdas, I encountered different results to the first program in that the padding variables were no longer sufficient to separate a and b on to separate cache lines, causing the threads to clash and again producing an output of 3000-4000 ms. This was solved by declaring an extra single dummy byte variable after the dummy longs:
private long dummy4 = 0;
private byte dummy5 = 0;
private volatile int b = 0;
The output after declaring this extra byte was then, once again, approximately 600-700 ms.
I have replicated this comparison numerous times on different systems and, strangely, this produces no consistent outcome. Sometimes, using lambdas over anonymous inner classes has no effect on the output, sometimes it does. Even attempting the same comparison on the same system at different times did not always produce the same results.
I'm at a loss trying to explain this behavior, and would greatly appreciate any help. Feel free to ask for clarification on anything, as I probably did not explain this very well.
Thanks!
I have a function say Foo()...My requirement to run some function for specific time and force it to return value within that given time.For eg if i run this function for 100 ms then no matter how many numbers are added in List listofnumbers whitin 100ms it should return those values.I have seen Timer as one solution but Timer or TimerTask has API to schedule task once every some seconds.What i want is to run function only and return whatever value it has wihtin given time.
foo()
{
List lisof numbers
for(int i=0;i<somenumber;i++)
{
listofnumbers.add(i);
}
return listofnumbers
}
The easiest way would be to just keep track of elapsed time.
List<Integer> generateList(int maxRuntime, int maxNum) {
long startTime = System.currentTimeMillis();
List<Integer> numbers = new ArrayList<Integer>();
for(int i = 0; i < maxNum; i++) {
if(System.currentTimeMillis() - startTime > maxRuntime) {
break;
}
numbers.add(i);
}
return numbers;
}
I called this:
System.out.println(generateList(100, Integer.MAX_VALUE));
Output is around 2041737.
Use an AtomicReference and a ScheduledExecutorService.
Create an AtomicReference<List<Something>>; make it shared by the readers and writers.
When a writer needs to write this list again, do:
final List<Something> list = new ArrayList<>(ref.get());
// modify list
ref.set(list);
and schedule that writer using the builtin ScheduledExecutorService capabilities.
In java:
r1=complexCalc1();
r2=complexCalc2();
r3=complexCalc3();
r4=complexCalc4();
r5=complexCalc5();
return r1+r2+r3+r4+r5;
Assume running times like
complexCalc1() -> 5 mins
complexCalc2() -> 3 mins
complexCalc3() -> 2 mins
complexCalc4() -> 4 mins
complexCalc5() -> 9 mins
if this program had run sequentially it would take 23
minutes for calculating r1+r2+r3+r4+r5. If each function had run parallely i.e. each
complexCalc() function in separate threads total time taken would be 9 mins for r1+r2+r3+r4+r5 computation.
My question is how to achieve it.. I tried several methods but i still
cant figure out anything concrete.
Thanks in advance.
A rough draft of the solution, using only standard Java API, looks like this:
public class Main {
private static final Callable<Integer> createCalculationSimulator (final int result, final int minutesToWait) {
return new Callable<Integer> () {
#Override
public Integer call() throws Exception {
Thread.sleep(minutesToWait*60*1000L);
return result;
}
};
}
public static void main(String[] args) throws Exception {
final ExecutorService executorService = Executors.newFixedThreadPool (5);
final long startTime = System.currentTimeMillis();
final List<Future<Integer>> results = executorService.invokeAll(
Arrays.asList(
createCalculationSimulator(1, 5),
createCalculationSimulator(2, 3),
createCalculationSimulator(3, 2),
createCalculationSimulator(4, 4),
createCalculationSimulator(5, 9)));
int resultSum = 0;
for (final Future<Integer> result : results) {
resultSum += result.get();
}
final long endTime = System.currentTimeMillis();
System.out.println("The end result is " + resultSum + ". Time needed = " + (endTime - startTime)/1000 + " seconds.");
}
}
If you can divide the task into logical independent tasks (which I believe you can as you already indicated) then it is fairly easy with Java 5+.
Implement each task in its own Callable
Submit all of them to the Executor. ExecutorService.invokeAll(...)
The above step returns a List which you will store and make sure all of them are completed (Look at the api)
Note
Initialize the thread pool size to be equal to the number of cores (Of-course, you tune after you profile.
If you can have external dependency then I suggest using Guava library that simplifies the usage of Executors.
In the following code snippet, Foo1 is a class that increments a counter every time the method bar() is called. Foo2 does the same thing but with one additional level of indirection.
I would expect Foo1 to be faster than Foo2, however in practice, Foo2 is consistently 40% faster than Foo1. How is the JVM optimizing the code such that Foo2 runs faster than Foo1?
Some Details
The test was executed with java -server CompositionTest.
Running the test with java -client CompositionTest produces the expected results, that Foo2 is slower than Foo1.
Switching the order of the loops does not make a difference.
The results were verified with java6 on both sun and openjdk's JVMs.
The Code
public class CompositionTest {
private static interface DoesBar {
public void bar();
public int count();
public void count(int c);
}
private static final class Foo1 implements DoesBar {
private int count = 0;
public final void bar() { ++count; }
public int count() { return count; }
public void count(int c) { count = c; }
}
private static final class Foo2 implements DoesBar {
private DoesBar bar;
public Foo2(DoesBar bar) { this.bar = bar; }
public final void bar() { bar.bar(); }
public int count() { return bar.count(); }
public void count(int c) { bar.count(c); }
}
public static void main(String[] args) {
long time = 0;
DoesBar bar = null;
int reps = 100000000;
for (int loop = 0; loop < 10; loop++) {
bar = new Foo1();
bar.count(0);
int i = reps;
time = System.nanoTime();
while (i-- > 0) bar.bar();
time = System.nanoTime() - time;
if (reps != bar.count())
throw new Error("reps != bar.count()");
}
System.out.println("Foo1 time: " + time);
for (int loop = 0; loop < 10; loop++) {
bar = new Foo2(new Foo1());
bar.count(0);
int i = reps;
time = System.nanoTime();
while (i-- > 0) bar.bar();
time = System.nanoTime() - time;
if (reps != bar.count())
throw new Error("reps != bar.count()");
}
System.out.println("Foo2 time: " + time);
}
}
Your microbench mark is meaningless. On my computer the code runs in about 8ms for each loop... To have any meaningful number a benchmark should probably run for at least a second.
When run both for around a second (hint, you need more than Integer.MAX_VALUE repetitions) I find that the run times of both are identical.
The likely explanation for this is that the JIT compiler has noticed that your indirection is meaningless and optimised it out (or at least inlined the method calls) such that the code executed in both loops is identical.
It can do this because it knows bar in Foo2 is effectively final, it also know that the argument to the Foo2 constructor is always going to be a Foo1 (at least in our little test). That way it knows the exact code path when Foo2.bar is called. It also knows that this loop is going to run a lot of times (in fact it knows exactly how many times the loop will execute) -- so it seems like a good idea to inline the code.
I have no idea if that is precisely what it does, but these are all logical observations that the JIT could me making about the code. Perhaps in the future some JIT compilers might even optimise the entire while loop and simply set count to reps, but that seems somewhat unlikely.
Trying to predict performance on modern languages is not very productive.
The JVM is constantly modified to increase performance of common, readable structures which, in contrast, makes uncommon, awkward code slower.
Just write your code as clearly as you can--then if you really identify a point where your code is actually identified as too slow to pass written specifications, you may have to hand-tweak some areas--but this will probably involve large, simple ideas like object caches, tweaking JVM options and eliminating truly stupid/wrong code (Wrong data structures can be HUGE, I once changed an ArrayList to a LinkedList and reduced an operation from 10 minutes to 5 seconds, multi-threading a ping operation that discovered a class-B network took an operation from 8+ hours to minutes).
Out of curiosity, I measured the performance between static block and static method initializer. First, I implemented the above mentioned methods in two separate java classes, like so:
First:
class Dummy {
static java.util.List<Integer> lista = new java.util.ArrayList<Integer>();
static {
for(int i=0; i < 1000000; ++i) {
lista.add(new Integer(i));
}
}
}
public class First {
public static void main(String[] args) {
long st = System.currentTimeMillis();
Dummy d = new Dummy();
long end = System.currentTimeMillis() - st;
System.out.println(end);
}
}
Second:
class Muddy {
static java.util.List<Integer> lista = new java.util.ArrayList<Integer>();
public static void initList() {
for(int i=0; i < 1000000; ++i) {
lista.add(new Integer(i));
}
}
}
public class Second {
public static void main(String[] args) {
long st = System.currentTimeMillis();
Muddy.initList();
Muddy m = new Muddy();
long end = System.currentTimeMillis() - st;
System.out.println(end);
}
}
Then I executed this little batch script to measure it 100 times and put the values in a file. batchFile.bat First Second dum.res.txt
After that, I wrote this piece of code to calculate mean value and standard deviation of Dummy's and Muddy's measured values.
This is the result that I've got:
First size: 100 Second size: 100
First Sum: 132 Std. deviation: 13
Second Sum: 112 Std. deviation: 9
And it is similar on my other machines...every time I test it.
Now I'm wondering, why is it so? I checked the bytecode and Second.class has one instruction more (call to static initList()) between calls to System.currentTimeMillis().
They both do the same thing, but why is the First one slower? I can't really reason it out just by looking at the bytecode, since this was my first time touching javap; I don't understand bytecode yet.
I think that the reason why the static block version is slower than the static method version could be due to the different JIT optimization that they get ...
See this interesting article for more interesting information : Java Secret: Are static blocks interpreted?
Here's my guess as to the reason for this:
The initialization you are doing is creating enough objects that it is causing one or more garbage collections.
When the initialization is called from the static block, it is done during the class initialization rather than during simple method execution. During class initialization, the garbage detector may have a little more work to do (because the execution stack is longer, for example) than during simple method execution, even though the contents of the heap are almost the same.
To test this, you could try adding -Xms200m or something to your java commands; this should eliminate the need to garbage collect during the initialization you are doing.