High Level Java Optimization - java

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.
My question is more of a High Level based optimization in design. Let's assume I have to do the following:
for a given string input, count the occurrence of each letter in the string.
this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.
So what is the high level design pattern that can be applied to this type of problem.
I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.
thanks in advance
Sam
p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.
I'll give you some thoughts on it anyway.
First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:
h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
h[w] = w in h ? h[w]++ : 1
}
Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:
Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file
These three steps go in a Unix pipeline. Let the OS do the work for you here.
Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.
Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.
ADDENDUM
The OP asked for an example of tokenizing the input in Java. Here is the easiest way:
import java.util.Scanner;
public class WordGenerator {
/**
* Tokenizes standard input into words, writing each word to standard output,
* on per line. Because it reads from standard input and writes to standard
* output, it can easily be used in a pipeline combined with sort, uniq, and
* any other such application.
*/
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
while (input.hasNext()) {
System.out.println(input.next().toLowerCase());
}
}
}
Now here is an example of using it:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator
This outputs
hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.
You can combine this tokenizer with sort and uniq like so:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq
Yielding
hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo
Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:
Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));
And now
echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq
Yields
hey
moe
nyuk
soitenly
why
woo
There is a blank line in the output; I'll let you figure out how to whack it. :)

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.
Some pseduo-code (may not compile)
HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
if (map.containsKey(c)) map.put(c, map.get(c) + 1);
else map.put(c, 1);
}

It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).

You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)
However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();
long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());
prints
Took 111 ms to count 139,715,647 characters
Even 11x times the number of words takes a fraction of a second.
A much longer parallel version is a little faster.
public static void main(String... args) throws InterruptedException, ExecutionException {
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
final String text = words.toString();
long start = System.nanoTime();
// start a thread pool to generate 4 tasks to count sections of the text.
final int nThreads = 4;
ExecutorService es = Executors.newFixedThreadPool(nThreads);
List<Future<int[]>> results = new ArrayList<Future<int[]>>();
int blockSize = (text.length() + nThreads - 1) / nThreads;
for (int i = 0; i < nThreads; i++) {
final int min = i * blockSize;
final int max = Math.min(min + blockSize, text.length());
results.add(es.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
int[] charCount = new int[Character.MAX_VALUE];
for (int j = min; j < max; j++)
charCount[text.charAt(j)]++;
return charCount;
}
}));
}
es.shutdown();
// combine the results.
int[] charCount = new int[Character.MAX_VALUE];
for (Future<int[]> resultFuture : results) {
int[] result = resultFuture.get();
for (int i = 0, resultLength = result.length; i < resultLength; i++) {
charCount[i] += result[i];
}
}
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}
prints
Took 45 ms to count 139,715,537 characters
But for a String with less than a million words its not likely to be worth it.

As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible.
If that means putting in a faster algorithm, do so, but at first, keep it simple.
For a small program like this, it won't be too hard.
The essential skill in performance tuning is not guessing.
Instead, let the program itself tell you what to fix.
This is my method.
For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.

You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.
On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.
You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.
I'll be glad of somebody point to another widely-used API.

Related

Custom sorting algorithm performance (vs Arrays.sort() and parallelSort())

I implemented a basic sorting algorithm in Java, and compared its performance to those of native methods (Arrays.sort() and Arrays.parallelSort()). The program is as follows.
public static void main(String[] args) {
// Randomly populate array
int[] array = new int[999999];
for (int i = 0; i < 999999; i++)
array[i] = (int)Math.ceil(Math.random() * 100);
long start, end;
start = System.currentTimeMillis();
Arrays.sort(array);
end = System.currentTimeMillis();
System.out.println("======= Arrays.sort: done in " + (end - start) + " ms ========");
start = System.currentTimeMillis();
Arrays.parallelSort(array);
end = System.currentTimeMillis();
System.out.println("======= Arrays.parallelSort: done in " + (end - start) + " ms ========");
start = System.currentTimeMillis();
orderArray(array);
end = System.currentTimeMillis();
System.out.println("======= My way: done in " + (end - start) + " ms ========");
}
private static int[] orderArray(int[] arrayToOrder) {
for (int i = 1; i < arrayToOrder.length; i++) {
int currentElementIndex = i;
while (currentElementIndex > 0 && arrayToOrder[currentElementIndex] < arrayToOrder[currentElementIndex-1]) {
int temp = arrayToOrder[currentElementIndex];
arrayToOrder[currentElementIndex] = arrayToOrder[currentElementIndex-1];
arrayToOrder[currentElementIndex-1] = temp;
currentElementIndex--;
}
}
return arrayToOrder;
}
When I run this program, my custom algorithm consistently outperforms the native queries, by orders of magnitude, on my machine. Here is a representative output I got:
======= Arrays.sort: done in 67 ms ========
======= Arrays.parallelSort: done in 26 ms ========
======= My way: done in 4 ms ========
This is independent of:
The number of elements in the array (999999 in my example)
The number of times the sort is performed (I tried inside a for loop and iterated a large number of times)
The data type (I tried with an array of double instead of int and saw no difference)
The order in which I call each ordering algorithm (does not affect the overall difference of performance)
Obviously, there's no way my algorithm is actually better than the ones provided with Java. I can only think of two possible explanations:
There is a flaw in the way I measure the performance
My algorithm is too simple and is missing some corner cases
I expect the latter is true, seen as I used a fairly standard way of measuring performance with Java (using System.currentTimeMillis()). However, I have extensively tested my algorithm and can find no fallacies as of yet - an int has predefined boundaries (Integer.MIN_VALUE and MAX_VALUE) and cannot be null, I can't think of any possible corner case I've not covered.
My algorithm's time complexity (O(n^2)) and the native methods' (O(n log(n)))), which could obviously cause an impact. Again, however, I believe my complexity is sufficient...
Could I get an outsider's look on this, so I know how I can improve my algorithm?
Many thanks,
Chris.
You're sorting an array in place, but you didn't re-scramble the array between each trail. This means you're sorting the best case scenario. In between each call to to an array sorting method you can re-create the array.
for (int i = 0; i < TEST_SIZE; i++)
array[i] = (int)Math.ceil(Math.random() * 100);
After doing this you will notice your algorithm is about 100 times slower.
That said, this is not the best way to compare the methods in the first place. At a minimum you should be sorting the same original array for each different algorithm. You should also perform multiple iterations over each algorithm and average the response. The result from a single trial will be spurious and not reliable as a good comparison.

Mulithreading Usage

I am iterating through a HashMap with +- 20 Million entries. In each iteration I am again iterating through HashMap with +- 20 Million entries.
HashMap<String, BitSet> data_1 = new HashMap<String, BitSet>
HashMap<String, BitSet> data_2 = new HashMap<String, BitSet>
I am dividng data_1 into chunks based on number of threads(threads = cores, i have four core processor).
My code is taking more than 20 Hrs to excute. Excluding not storing the results into a file.
1) If i want to store the results of each thread without overlapping into a file, How can i
do that?.
2) How can i make the following much faster.
3) How to create the chunks dynamically, based on number of cores?
int cores = Runtime.getRuntime().availableProcessors();
int threads = cores;
//Number of threads
int Chunks = data_1.size() / threads;
//I don't trust with chunks created by the below line, that's why i created chunk1, chunk2, chunk3, chunk4 seperately and validated them.
Map<Integer, BitSet>[] Chunk= (Map<Integer, BitSet>[]) new HashMap<?,?>[threads];
4) How to create threads using for loops? Is it correct what i am doing?
ClassName thread1 = new ClassName(data2, chunk1);
ClassName thread2 = new ClassName(data2, chunk2);
ClassName thread3 = new ClassName(data2, chunk3);
ClassName thread4 = new ClassName(data2, chunk4);
thread1.start();
thread2.start();
thread3.start();
thread4.start();
thread1.join();
thread2.join();
thread3.join();
thread4.join();
Representation of My Code
Public class ClassName {
Integer nSimilarEntities = 30;
public void run() {
for (String kNonRepeater : data_1.keySet()) {
// Extract the feature vector
BitSet vFeaturesNonRepeater = data_1.get(kNonRepeater);
// Calculate the sum of 1s (L2 norm is the sqrt of this)
double nNormNonRepeater = Math.sqrt(vFeaturesNonRepeater.cardinality());
// Loop through the repeater set
double nMinSimilarity = 100;
int nMinSimIndex = 0;
// Maintain the list of top similar repeaters and the similarity values
long dpind = 0;
ArrayList<String> vSimilarKeys = new ArrayList<String>();
ArrayList<Double> vSimilarValues = new ArrayList<Double>();
for (String kRepeater : data_2.keySet()) {
// Status output at regular intervals
dpind++;
if (Math.floorMod(dpind, pct) == 0) {
System.out.println(dpind + " dot products (" + Math.round(dpind / pct) + "%) out of "
+ nNumSimilaritiesToCompute + " completed!");
}
// Calculate the norm of repeater, and the dot product
BitSet vFeaturesRepeater = data_2.get(kRepeater);
double nNormRepeater = Math.sqrt(vFeaturesRepeater.cardinality());
BitSet vTemp = (BitSet) vFeaturesNonRepeater.clone();
vTemp.and(vFeaturesRepeater);
double nCosineDistance = vTemp.cardinality() / (nNormNonRepeater * nNormRepeater);
// queue.add(new MyClass(kRepeater,kNonRepeater,nCosineDistance));
// if(queue.size() > YOUR_LIMIT)
// queue.remove();
// Don't bother if the similarity is 0, obviously
if ((vSimilarKeys.size() < nSimilarEntities) && (nCosineDistance > 0)) {
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
} else { // If there are more, keep only the best
// If this is better than the smallest distance, then remove the smallest
if (nCosineDistance > nMinSimilarity) {
// Remove the lowest similarity value
vSimilarKeys.remove(nMinSimIndex);
vSimilarValues.remove(nMinSimIndex);
// Add this one
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
// Refresh the index of lowest similarity value
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
}
} // End loop for maintaining list of similar entries
}// End iteration through repeaters
for (int i = 0; i < vSimilarValues.size(); i++) {
System.out.println(Thread.currentThread().getName() + kNonRepeater + "|" + vSimilarKeys.get(i) + "|" + vSimilarValues.get(i));
}
}
}
}
Finally, If not Multithreading, is there any other approaches in java, to reduce time complexity.
The computer works similarly to what you have to do by hand (It processes more digits/bits at a time but the problem is the same.
If you do addition, the time is proportional to the of the size of the number.
If you do multiplication or divisor it's proportional to the square of the size of the number.
For the computer the size is based on multiples of 32 or 64 significant bits depending on the implementation.
I'd say this task is suitable for parallel streams. Don't hesitate to take a look at this conception if you have time. Parallel streams seamlessly use multithreading at full speed.
The top-level processing will look like this:
data_1.entrySet()
.parallelStream()
.flatmap(nonRepeaterEntry -> processOne(nonRepeaterEntry.getKey(), nonRepeaterEntry.getValue(), data2))
.forEach(System.out::println);
You should provide processOne function with prototype like this:
Stream<String> processOne(String nonRepeaterKey, String nonRepeaterBitSet, Map<String BitSet> data2);
It will return prepared string list with what you print now into file.
To make stream inside you can prepare List list first and then turn it into stream in return statement:
return list.stream();
Even though inner loop can be processed in streams, parallel streaming inside is discouraged - you already have enough parallelism.
For your questions:
1) If i want to store the results of each thread without overlapping into a file, How can i do that?.
Any logging framework (logback, log4j) can deal with it. Parallel streams can deal with it. Also you can store prepared lines into some queue/array and print them in separate thread. It takes a bit of care, though, ready solutions are easier and effectively they do such thing.
2) How can i make the following much faster.
Optimize and parallelize. At normal situation you get number_of_threads/1.5..number_of_threads times faster processing thinking you have hyperthreading in play, but it depends on things you do not-so-parallel and underlying implementations of stuff.
3) How to create the chunks dynamically, based on number of cores?
You don't have to. Make a list of tasks (1 task per data_1 entry) and feed executor service with them - that's already big enough task size. You can use FixedThreadPool with number of threads as parameter, and it will deal will distribute tasks evenly.
Not you should create task class, get Future for each task upon threadpool.submit and in the end run a loop doing .get for each Future. It will throttle main thread down to executor processing speed implicitly doing fork-join like behaviour.
4) Direct threads creation is outdated technique. It's recommended to use executor service of some sort, parallel streams etc. For loop processing you need to create list of chunks, and in loop create thread, add it to list of threads. And in another loop join to each thread if the list.
Ad hoc optimizations:
1) Make Repeater class that will store key, bitset and cardinality. Preprocess your hashsets turning them into Repeater instances and calculating cardinality once (i.e. not for every inner loop run). It will save you 20mil*(20mil-1) calls of .cardinality(). You still need to call it for difference.
2) Replace similarKeys, similarValues with limited size priorityQueue on combined entries. It works faster for 30 elements.
Take a look at this question for infor about PriorityQueue:
Java PriorityQueue with fixed size
3) You can skip processing of nonRepeater if its cardinality is already 0 - bitSet and will never increase resulting cardinality, and you'll filter out all 0-distance values.
4) You can skip (remove from temporary list you create in p.1 optimization) every Repeater with zero cardinality. Like in p.3 it will never produce anything fruitful.

Can Java Streams transform a list of points into a list of their coordinates?

I have a stream of Point3Ds in a JavaFX 8 program. I would like, for the sake of creating a Mesh from them, to be able to produce a list of their (x, y, z) coordinates instead.
This is a simple enough task through traditional Java looping. (Almost trivial, actually.) However, in the future, I'll likely be dealing with tens of thousands of points; and I would very much like to be able to use the Java Stream API and accomplish this with a parallel stream.
I suppose what I'm looking for is the rough equivalent of this psuedocode:
List<Double> coordinates = stream.parallel().map(s -> (s.getX(), s.getY(), s.getZ())).collect(Collectors.asList());
As of yet, I've found no such feature though. Could someone kindly give me a push in the right direction?
You can use flatMap :
List<Double> coordinates =
stream.parallel()
.flatMap(s -> Stream.of(s.getX(), s.getY(), s.getZ()))
.collect(Collectors.asList());
Why? Even with "tens of thousands of points", the code will complete in very little time, and you won't really gain anything "with a parallel stream".
This sounds like a perfect example of premature optimization, where you potentially complicate the code for something that isn't (yet) a problem, and is unlikely to ever be one, in this case at least.
To prove my point, I created the test code below.
To minimize the effect of GC runs, I ran this code with -Xms10g -Xmx10g, and added the explicit gc() calls, so test runs were running with a "clean slate".
As always, performance testing is subject to JIT optimizations and other factors, so a warm-up loop was provided.
public static void main(String[] args) {
Random rnd = new Random();
List<Point3D> input = new ArrayList<>();
for (int i = 0; i < 10_000; i++)
input.add(new Point3D(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble()));
for (int i = 0; i < 100; i++) {
test1(input);
test2(input);
}
for (int i = 0; i < 10; i++) {
long start1 = System.nanoTime();
test1(input);
long end1 = System.nanoTime();
System.gc();
long start2 = System.nanoTime();
test2(input);
long end2 = System.nanoTime();
System.gc();
System.out.printf("%.6f %.6f%n", (end1 - start1) / 1_000_000d, (end2 - start2) / 1_000_000d);
}
}
private static List<Double> test1(List<Point3D> input) {
List<Double> list = new ArrayList<>();
for (Point3D point : input) {
list.add(point.getX());
list.add(point.getY());
list.add(point.getZ());
}
return list;
}
private static List<Double> test2(List<Point3D> input) {
return input.stream().parallel()
.flatMap(s -> Stream.of(s.getX(), s.getY(), s.getZ()))
.collect(Collectors.toList());
}
RESULT
0.355267 0.392904
0.205576 0.260035
0.193601 0.232378
0.194740 0.290544
0.193601 0.238365
0.243497 0.276286
0.200728 0.243212
0.197022 0.240646
0.192175 0.239790
0.198162 0.279708
No major difference, although parallel stream seems slightly slower.
Also notice that it completes in less than 0.3 ms, for 10,000 points.
It's nothing!
Let's try to increase the count from 10,000 to 10,000,000 (skipping warmup):
433.716847 972.100743
260.662700 693.263850
250.699271 736.744653
250.486281 813.615375
249.722716 714.296997
254.704145 796.566859
254.713840 829.755767
253.368331 959.365322
255.016928 973.306254
256.072177 1047.562090
Now there's a definite degradation of the parallel stream. It is 3 times slower. This is likely caused by extra GC runs.
CONCLUSION: Premature optimization is bad!!!!
In your case, you actually made it worse.

Can this code be more efficient?

This Program should do this
N 10*N 100*N 1000*N
1 10 100 1000
2 20 200 2000
3 30 300 3000
4 40 400 4000
5 50 500 5000
So here's my code:
public class ex_4_21 {
public static void main( String Args[] ){
int process = 1;
int process2 = 1;
int process22 = 1;
int process3 = 1;
int process33 = 2;
System.out.println("N 10*N 100*N 1000*N");
while(process<=5){
while(process2<=3){
System.out.printf("%d ",process2);
while(process22<=3){
process2 = process2 * 10;
System.out.printf("%d ",process2);
process22++;
}
process2++;
}
process++;
}
}
}
Can my code be more effecient? I am currently learning while loops. So far this what I got. Can anyone make this more efficient, or give me ideas on how to make my code more efficient?
This is not a homework, i am self studying java
You can use a single variable n to do this.
while(n is less than the maximum value that you wish n to be)
print n and a tab
print n * 10 and a tab
print n * 100 and a tab
print n * 1000 and a new line
n++
if the power of 10 is variable then you can try this:
while(n is less than the maximum value that you wish n to be)
while(i is less than the max power of ten)
print n * i * 10 and a tab
i++
print a newline
n++
If you must use a while loop
public class ex_4_21 {
public static void main( String Args[] ){
int process = 1;
System.out.println("N 10*N 100*N 1000*N");
while(process<=5){
System.out.println(process + " " + 10*process + " " + 100*process + " " + 1000*process + "\n");
process++;
}
}
}
You have one too many while loops (your "process2" while loop is unnecessary). You also appear to have some bugs related to the fact that the variables you are looping on in the inner loops are not re-initialized with each iteration.
I would also recommend against while loops for this; Your example fits a for loop much better; I understand you are trying to learn the looping mechanism, but part of learning should also be in deciding when to use which construct. This really isn't a performance recommendation, more an approach recommendation.
I don't have any further performance improvement suggestions, for what you are trying to do; You could obviously remove loops (dropping down to a single or even no loops), but two loops makes sense for what you are doing (allows you to easily add another row or column to the output with minimal changes).
You can try loop unrolling, similar to #Vincent Ramdhanie's answer.
However, loop unrolling and threading won't produce a significant performance improvement for such a small sample. The overhead involved in creating and launching threads (processes) takes more time than a simple while loop. The overhead in I/O will take more time than the unrolled version saves. A complex program is harder to debug and maintain than a simple one.
You're thinking is called microoptimization. Save the optimizations for larger programs and only when the requirements cannot be met or the customer(s) demand so.

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

Categories