CPU Resources and Clock Cycles: System.out.println Or Incrementing a flag - java

To debug our Android code we have put System.out.println(string) which will let us know how many times a function has been called. The other method would have been to put a flag and keep on incrementing it after every function call. And then at the end printing the final value of flag by System.out.println(...). (practically in my application the function will be called thousands of time)
My question is: In terms of CPU Resources and Clock Cycles which one is lighter: increment operation Or System.out.println?

Incrementing is going to be much, much more efficient - especially if you've actually got anywhere for that output to go. Think of all the operations required by System.out.println vs incrementing a variable. Of course, whether the impact will actually be significant is a different matter - and if your method is already doing a lot of work, then a System.out.println call may not actually make much difference. But if you just want to know how many times it was called, then keeping a counter makes more sense than looking through the logs anyway, IMO.
I would recommend using AtomicLong or AtomicInteger instead of just having a primitive variable, as that way you get simple thread-safety.

Incrementing will be a lot faster in terms of clock cycles. Assuming the increment is fairly close to a hardware increment it would only take a couple of clock cycles. That means you can do millions every second.
On the other hand System.out.println will have to call out to the OS. Use stdout. Convert characters, etc. Each of these steps will take many, many clock cycles.
Coming back to your original question, if you're looking at how many times a function gets called you could try and run a profiler - there are various desktop and android solutions available. That way you wouldn't need to pollute your code with counting/printing, and you can keep your production code lean.
Again thinking a litle further, why would you like to know exact number of times a function is called? If you're concerned about a defect consider writing some unit tests that will prove exactly how many times a function gets called. If you're concerned about performance, perhaps look at load test techniques in combination with your profiler.

Related

Which manner of processing two repetative operations will be completed faster?

Generally speaking, is there a significant differance in the processing speeds of these two example segments of code, and if so, which should complete faster? Assume that "processA(int)" and "processB(int)" are voids that are common to both examples.
for(int x=0;x<1000;x++){
processA(x);
processB(x);
}
or
for(int x=0;x<1000;x++){
processA(x);
}
for(int x=0;x<1000;x++){
processB(x);
}
I'm looking to see if I can speed up one of my programs and it involves cycling through blocks of data several times and processing it differant ways. Currantly, it runs a seperate cycle for each processing method, meaning a lot of cycles get run in total, but each cycle does very little work. I was thinking about rewriting my code so that each cycle incorporates every processing method; in other words, much fewer cycles, but each cycle has a heavier workload.
This would be a very intensive rewrite of my program stucture. So unless it would give me a significant performance boost, it won't be worth the trouble.
The first case will be slightly faster then the second case because a for loop in and of itself has an effect on performance. However, the main question you should ask yourself is this: will the effect be of significance to my program? If not, you should opt for clear and readable code.
One thing to remember in such a case is that the JVM (Java Virtual Machine) does a whole lot of optimisation, and in your case the JVM can even get rid of the for-loop and rewrite the code into 1000 successive calls to processA() and processB(). So even if you have two for loops, the JVM can get rid of both, making your program more optimal than even your first case.
To get a basic understanding of method calls, cost, and the JVM, you can read this short article:
https://plumbr.eu/blog/how-expensive-is-a-method-call-in-java
Absolutely, fusing the 2 loops will be faster. (Some compilers do this automatically as an optimization.) How much faster? That depends on how many iterations the loops are running. Unless the number of iterations is very high, you can expect that the improvement will be minimal.
The single loop case will contain fewer instructions so will run faster.
But, unless processA and processB are very quick functions, such a substantial refactoring would give you negligible performance gain.
If this is production code, you should also take care as there may be side effects. You should make alterations in the context of a unit testing framework testing the code in question. (In C++ for example x may be passed by reference and could be modified by the functions! Of course Java has no such hazard but there may be other reasons why all processA functions have to run before all processB functions and the program comments may not make this clear).
The first piece of code is faster, because it only has one for loop.
If, however, you need to do something AFTER processA() has been executed n times and before processB()'s loop starts, then the second option would be ideal.

Calling the function twice vs. storing the output and using it in Java

Suppose that I have a boolean function isCorrect(Set<Integer>).
The parameter of the function is calculated by another function buildSet().
Which one is better in terms of both time and space efficiency?
Set<Integer> set = buildSet();
if(isCorrect(set))
doSomethingWith(set);
or
if(isCorrect(buildSet()))
doSomethingWith(buildSet());
The first approach is better, and I don't think this is a matter of opinion. Don't call the same function twice wastefully when you already have its result. Of course, I'm assuming buildSet() doesn't have any necessary side effects.
Which one is better in terms of both time and space efficiency?
In terms of time, you're building the set once in the first snippet and twice in the second snippet, so presumably the second would take longer. In terms of space, there likely will not be a difference. However, you seem to be instantiating two objects in your second snippet and just one in your first (again, I can't be sure of this because I don't know how buildSet() is implemented). If this is the case and you're retaining both of those objects, then the second snippet will use twice the space as well.
The answer is - it depends. See below:
If function call is time-consuming, you definenetely should store the result;
Storing the result might make the code a little bit less readable, although readability is not an objective metric;
Sometimes you may want to be sure, that first and second usage deals with exactly the same object. When you call function twice, you may get different results. This situation is usually called race condition or data race and in most cases it affects program correctness.
So, to summarize: in most cases it makes sense to store the result. Sometimes (but IMO not too often) it is not really neccesary.
While the existing answers give good reasons why storing the value is best, they miss out on what I feel is an important one (in fact, the most important one): in your second example (running the function twice), you introduce a potential race condition.
If buildSet() relies on outside factors (which is highly probable in any non-trivial function - and could become true with later changes), there is a chance that the value changes between the if check and the second call. This could create a subtle and hard to find bug, which would potentially only become visible when you made changes elsewhere, or certain events happened with specific timings.
This is, by itself, a good reason to avoid such a pattern.
By going from the second example (two calls) to the first (one call), you will save the time that the second call to buildSet is on the stack. If that call is on the stack 10% of the time, then your speedup will be a factor of 100/90 = 1.11, or 11%. If it is on the stack 50% of the time, the speedup will be a factor of 100/50 = 2, or 100%.
How do you know what fraction of time a function call is on the stack?
It is inclusive wall-clock percent by line.
Not every profiler will tell you this.
If it only tells you "self time", not inclusive time, that will not tell you anything about the time spent in a function call.
If it only tells you by function, not by line, you can't tell if the second call was the problem (as opposed to the first).
If it is only a "CPU-profiler", then if there is any I/O, sleep, or lock wait in the buildSet function or elsewhere in the app, the profiler will act like it doesn't exist.
If it does not tell you percent, but only milliseconds or call counts, then you gotta do the math to figure out what the percent of total time in the call is.
Call graphs don't tell you, "flame graphs" don't tell you, timelines don't tell you, etc. etc.
One that does tell you is Zoom.
Others may, if you can figure out how to tell them what to do.
The method I and many people use is random pausing.

Timing a block of code with C++ and Java

I am trying to compare the accuracy of timing methods with C++ and Java.
With C++ I usually use CLOCKS_PER_SEC, I run the block of code I want to time for a certain amount of time and then calculate how long it took, based on how many times the block was executed.
With Java I usually use System.nanoTime().
Which one is more accurate, the one I use for C++ or the one I use for Java? Is there any other way to time in C++ so I don't have to repeat the piece of code to get a proper measurement? Basically, is there a System.nanoTime() method for C++?
I am aware that both use system calls which cause considerable latencies. How does this distort the real value of the timing? Is there any way to prevent this?
Every method has errors. Before you spend a great deal of time on this question, you have to ask yourself "how accurate do I need my answer to be"? Usually the solution is to run a loop / piece of code a number of times, and keep track of the mean / standard deviation of the measurement. This is a good way to get a handle on the repeatability of your measurement. After that, assume that latency is "comparable" between the "start time" and "stop time" calls (regardless of what function you used), and you have a framework to understand the issues.
Bottom line: clock() function typically gives microsecond accuracy.
See https://stackoverflow.com/a/20497193/1967396 for an example of how to go about this in C (in that instance, using a usec precision clock). There's the ability to use ns timing - see for example the answer to clock_gettime() still not monotonic - alternatives? which uses clock_gettime(CLOCK_MONOTONIC_RAW, &tSpec);
Note that you have to extract seconds and nanoseconds separately from that structure.
Be careful using System.nanoTime() as it is still limited by the resolution that the machine you are running on can give you.
Also there are complications timing Java as the first few times through a function will be a lot slower until they get optimized for your system.
Virtually all modern systems use pre-emptive multi threading and multiple cores, etc - so all timings will vary from run to run. (For example if control gets switched away from your thread while it in the method).
To get reliable timings you need to
Warm up the system by running around the thing you are timing a few hundred times before starting.
Run the code for a good number of times and average the results.
The reliability issues are the same for any language so apply just as well to C as to Java so C may not need the warm-up loop but you will still need to take a lot of samples and average them.

Automatic parallelization

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?
This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.
This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.
It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.
Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.
There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.
I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019
The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

Capturing the executing time between 2 statements Java?

I want to capture the time take to go from statement A to Statement B in a Java class. In between these statements there are many web service calls made. I wanted to know if there is some stop watch like functionality in java that i could use to capture the exact time?
Kaddy
This will give you the number of nanoseconds between the two nanoTime() calls.
long start = System.nanoTime();
// Java statements
long diff = System.nanoTime() - start;
For more sophisticated approaches there are several duplicate questions that address Stopwatch classes:
Java performance timing library
Stopwatch class for Java
#Ben S's answer is spot on.
However, it should be noted that the approach of inserting time measurement statements into your code does not scale:
It makes your code look a mess.
It makes your application run slower. Those calls to System.nanoTime() don't come for free!
It introduces the possibility of bugs.
If your real aim is to try and work out why your application is running slowly so that you decide what what to optimize, then a better solution is to use a Java profiler. This has the advantage that you need to make ZERO changes to your source code. (Of course, profiling doesn't give you the exact times spent in particular sections. Rather, it gives you time proportions ... which is far more useful for deciding where to optimize.)
System.currentTimeMillis will get it in milliseconds and nanoTime in nanosceconds.
If you're trying to compare the performance of different techniques, note that the JVM environment is complex so simply taking one time is not meaningful. I always write a loop where I execute method 1 a few thousand times, then do a System.gc, then execute method 2 a few thousands times, then do another System.gc, then loop back and do the whole thing again at least five or six times. This helps to average out time for garbage collection, just-in-time compiles, and other magic things happening in the JVM.

Categories