Algorithm optimization - parallel AsyncTasks or threads?

Algorithm optimization - parallel AsyncTasks or threads? - java

I currently have a single AsyncTask which currently compares images using the bubble sort technique using OpenCV. Say, I have to compare 400 images to each other. This would mean 400*401/2=80,200 comparisons. Let's assume one comparison takes 1 second. So, that's 80,200 sec which is around 22.27 hours which is ridiculously long. So, I developed an algorithm of this type:
It divides the 400 images into groups of 5. So there are 80 images in each group.
The first part of the algorithm is the images comparing themselves within the group members.
So, image1 will compare itself with image2-80, which means there are 79 comparisons. image2 will have 78 comparisons and so on. Which makes 3,160 comparisons. Or 3,160 sec. Similarly, image81 will compare itself with image82-160 and so on. So all the "group comparisons" are finished in 3,160 sec because they're run in parallel.
The second part of the algorithm will compare group 1 elements with group 2 elements, group 2 with group 3, group 3 with group 4 and so on. This would mean image1 will be compared with image81-160, which is 80 comparisons and so total comparisons between group 1 and group 2 would be 80*80=6400 comparisons. Is it possible to have each image comparison in parallel with group comparisons? That is if image1 is comparing itself with image81-160 then image2 should do the same and so on, while the other groups are doing the same. So, this part should take only 6400 sec.
Now, group1 will be compared with group3, group2 with group4, group3 with group5. ->6400 sec
After which, group1 will be compared with group4 and group2 with group5. ->6400 sec
So all groups are compared.
Total time = 3160+6400+6400+6400=22,360sec. I realize the more the groups, more time it'd take. So, I'd have to increase group size to reduce the increase in time. Either way, it cuts down the time to almost 1/4th it's actual time.
Is this algorithm unrealistic? If so, why? What are it's flaws? How would I fix it? Is there a better algorithm to compare a list of images faster? Obviously not quick sort, I can't arrange the images in an ascending or descending order. Or can I?
If this algorithm is possible? What would be the best way to implement it? Thread or AsyncTask?

This is a realistic algorithm, but ideally you'll want to be able to use the same number of worker threads throughout the program. For this you'll need to use an even number of threads, say 8.
On Pass1, Thread1 processes images 1-50, Thread2 processes images 51-100, etc.
On Pass2, Thread1 and Thread2 both process images 1-100. Thread1 processes images 1-25 and 50-75, Thread2 processes images 26-50 and images 76-100. Then Thread1 processes images 1-25 and 76-100, and Thread2 processes images 26-75.
Passes 3 through 8 follow the same pattern - the two threads assigned to the two groups being processed split up the groups between them. This way you keep all of your threads busy. However, you'll need an even number of threads for this in order to simplify group partitioning.
Sample code for 4 threads
class ImageGroup {
final int index1;
final int index2;
}
class ImageComparer implements Runnable {
final Image[] images;
ImageGroup group1;
ImageGroup group2;
public ImageComparer(Image[] images, ImageGroup group1, ImageGroup group2) {
...
}
public void run() {
if(group2 == null) { // Compare images within a single group
for(int i = group1.index1; i < group1.index2; i++) {
for(int j = i + 1; j < group1.inex2; j++) {
compare(images[i], images[j]);
}
}
} else { // Compare images between two groups
for(int i = group1.index1; i < group1.index2; i++) {
for(int j = group2.index1; j < group2.index2; j++) {
compare(images[i], images[j]);
}
}
}
}
}
ExecutorService executor = new ThreadPoolExecutor(); // use a corePoolSize equal to the number of target threads
// for 4 threads we need 8 image groups
ImageGroup group1 = new ImageGroup(0, 50);
ImageGroup group2 = new ImageGroup(50, 100);
...
ImageGroup group8 = new ImageGroup(450, 500);
ImageComparer comparer1 = new ImageComparer(images, group1, null);
ImageComparer comparer2 = new ImageComparer(images, group3, null);
...
ImageComparer comparer4 = new ImageComparer(images, group7, null);
// submit comparers to executor service
Future future1 = executor.submit(comparer1);
Future future2 = executor.submit(comparer2);
Future future3 = executor.submit(comparer3);
Future future4 = executor.submit(comparer4);
// wait for threads to finish
future1.get();
future2.get();
future3.get();
future4.get();
comparer1 = new ImageComparer(images, group2, null);
...
comparer4 = new ImageComparer(images, group8, null);
// submit to executor, wait to finish
comparer1 = new ImageComparer(images, group1, group3);
...
comparer4 = new ImageComparer(images, group7, group6);
// submit to executor, wait to finish
comparer1 = new ImageComparer(images, group1, group4);
...
comparer4 = new ImageComparer(images, group7, group5);

Related

Mulithreading Usage

I am iterating through a HashMap with +- 20 Million entries. In each iteration I am again iterating through HashMap with +- 20 Million entries.
HashMap<String, BitSet> data_1 = new HashMap<String, BitSet>
HashMap<String, BitSet> data_2 = new HashMap<String, BitSet>
I am dividng data_1 into chunks based on number of threads(threads = cores, i have four core processor).
My code is taking more than 20 Hrs to excute. Excluding not storing the results into a file.
1) If i want to store the results of each thread without overlapping into a file, How can i
do that?.
2) How can i make the following much faster.
3) How to create the chunks dynamically, based on number of cores?
int cores = Runtime.getRuntime().availableProcessors();
int threads = cores;
//Number of threads
int Chunks = data_1.size() / threads;
//I don't trust with chunks created by the below line, that's why i created chunk1, chunk2, chunk3, chunk4 seperately and validated them.
Map<Integer, BitSet>[] Chunk= (Map<Integer, BitSet>[]) new HashMap<?,?>[threads];
4) How to create threads using for loops? Is it correct what i am doing?
ClassName thread1 = new ClassName(data2, chunk1);
ClassName thread2 = new ClassName(data2, chunk2);
ClassName thread3 = new ClassName(data2, chunk3);
ClassName thread4 = new ClassName(data2, chunk4);
thread1.start();
thread2.start();
thread3.start();
thread4.start();
thread1.join();
thread2.join();
thread3.join();
thread4.join();
Representation of My Code
Public class ClassName {
Integer nSimilarEntities = 30;
public void run() {
for (String kNonRepeater : data_1.keySet()) {
// Extract the feature vector
BitSet vFeaturesNonRepeater = data_1.get(kNonRepeater);
// Calculate the sum of 1s (L2 norm is the sqrt of this)
double nNormNonRepeater = Math.sqrt(vFeaturesNonRepeater.cardinality());
// Loop through the repeater set
double nMinSimilarity = 100;
int nMinSimIndex = 0;
// Maintain the list of top similar repeaters and the similarity values
long dpind = 0;
ArrayList<String> vSimilarKeys = new ArrayList<String>();
ArrayList<Double> vSimilarValues = new ArrayList<Double>();
for (String kRepeater : data_2.keySet()) {
// Status output at regular intervals
dpind++;
if (Math.floorMod(dpind, pct) == 0) {
System.out.println(dpind + " dot products (" + Math.round(dpind / pct) + "%) out of "
+ nNumSimilaritiesToCompute + " completed!");
}
// Calculate the norm of repeater, and the dot product
BitSet vFeaturesRepeater = data_2.get(kRepeater);
double nNormRepeater = Math.sqrt(vFeaturesRepeater.cardinality());
BitSet vTemp = (BitSet) vFeaturesNonRepeater.clone();
vTemp.and(vFeaturesRepeater);
double nCosineDistance = vTemp.cardinality() / (nNormNonRepeater * nNormRepeater);
// queue.add(new MyClass(kRepeater,kNonRepeater,nCosineDistance));
// if(queue.size() > YOUR_LIMIT)
// queue.remove();
// Don't bother if the similarity is 0, obviously
if ((vSimilarKeys.size() < nSimilarEntities) && (nCosineDistance > 0)) {
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
} else { // If there are more, keep only the best
// If this is better than the smallest distance, then remove the smallest
if (nCosineDistance > nMinSimilarity) {
// Remove the lowest similarity value
vSimilarKeys.remove(nMinSimIndex);
vSimilarValues.remove(nMinSimIndex);
// Add this one
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
// Refresh the index of lowest similarity value
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
}
} // End loop for maintaining list of similar entries
}// End iteration through repeaters
for (int i = 0; i < vSimilarValues.size(); i++) {
System.out.println(Thread.currentThread().getName() + kNonRepeater + "|" + vSimilarKeys.get(i) + "|" + vSimilarValues.get(i));
}
}
}
}
Finally, If not Multithreading, is there any other approaches in java, to reduce time complexity.

The computer works similarly to what you have to do by hand (It processes more digits/bits at a time but the problem is the same.
If you do addition, the time is proportional to the of the size of the number.
If you do multiplication or divisor it's proportional to the square of the size of the number.
For the computer the size is based on multiples of 32 or 64 significant bits depending on the implementation.

I'd say this task is suitable for parallel streams. Don't hesitate to take a look at this conception if you have time. Parallel streams seamlessly use multithreading at full speed.
The top-level processing will look like this:
data_1.entrySet()
.parallelStream()
.flatmap(nonRepeaterEntry -> processOne(nonRepeaterEntry.getKey(), nonRepeaterEntry.getValue(), data2))
.forEach(System.out::println);
You should provide processOne function with prototype like this:
Stream<String> processOne(String nonRepeaterKey, String nonRepeaterBitSet, Map<String BitSet> data2);
It will return prepared string list with what you print now into file.
To make stream inside you can prepare List list first and then turn it into stream in return statement:
return list.stream();
Even though inner loop can be processed in streams, parallel streaming inside is discouraged - you already have enough parallelism.
For your questions:
1) If i want to store the results of each thread without overlapping into a file, How can i do that?.
Any logging framework (logback, log4j) can deal with it. Parallel streams can deal with it. Also you can store prepared lines into some queue/array and print them in separate thread. It takes a bit of care, though, ready solutions are easier and effectively they do such thing.
2) How can i make the following much faster.
Optimize and parallelize. At normal situation you get number_of_threads/1.5..number_of_threads times faster processing thinking you have hyperthreading in play, but it depends on things you do not-so-parallel and underlying implementations of stuff.
3) How to create the chunks dynamically, based on number of cores?
You don't have to. Make a list of tasks (1 task per data_1 entry) and feed executor service with them - that's already big enough task size. You can use FixedThreadPool with number of threads as parameter, and it will deal will distribute tasks evenly.
Not you should create task class, get Future for each task upon threadpool.submit and in the end run a loop doing .get for each Future. It will throttle main thread down to executor processing speed implicitly doing fork-join like behaviour.
4) Direct threads creation is outdated technique. It's recommended to use executor service of some sort, parallel streams etc. For loop processing you need to create list of chunks, and in loop create thread, add it to list of threads. And in another loop join to each thread if the list.
Ad hoc optimizations:
1) Make Repeater class that will store key, bitset and cardinality. Preprocess your hashsets turning them into Repeater instances and calculating cardinality once (i.e. not for every inner loop run). It will save you 20mil*(20mil-1) calls of .cardinality(). You still need to call it for difference.
2) Replace similarKeys, similarValues with limited size priorityQueue on combined entries. It works faster for 30 elements.
Take a look at this question for infor about PriorityQueue:
Java PriorityQueue with fixed size
3) You can skip processing of nonRepeater if its cardinality is already 0 - bitSet and will never increase resulting cardinality, and you'll filter out all 0-distance values.
4) You can skip (remove from temporary list you create in p.1 optimization) every Repeater with zero cardinality. Like in p.3 it will never produce anything fruitful.

Problems Synchronizing threads in Java

ok so I did my research there is plenty of questions here on thread synchronization but non of them really hit the point. I am currently working in Opencv, I get a frame from the camera containing vehicles, remove the background and track the vehicles, but before I do this I do some pre-processing and post-processing like removing noise with blur, all this runs in a single thread and it works great but here comes an issue, I now want to read number plates, for this i need a higher resolution frame otherwise for every frame I will not detect a single plate, but as soon as i increase my frame size I get a performance hit,my threads slows down to the point that my program no longer qualifies to be a real time system.
So I thought of adding more threads to my scene each to specialize on one task
here is a list of my tasks
//recieves fame from Camera
1. preprocess
//recieves a Frame from preprocess and removes the background
2. remove background
//recieves a Frame from backgroundremover and tracks the vehicles
3. postprocess
If I run the threads one by one am thinking it will still be slow instead I thought or running the threads simultenously but the issues it they use the same objects, declaring them volatile will mean threads waiting for the thread with lock to complete for it to use the object which will mean a slow system again so my question is how can I run these threads simultaneously without having to wait for others?
I have looked at a dozen of multithreading techniques in Java but finding it really hard to come up with a way of making this work.
So far I have looked at
1. Thread synchronization using the keyword volatile
2. Thread synchronization using the keyword synchronized
3. Multiple thread locks using a lock object
4. Using threadpools
5. Using the Countdown Latch
6. Wait and motify
7. Using Semaphores(which seemed like a good idea).
Here is the code I want to break down into those threads
public void VideoProcessor()
{
videProcessorThread = new Thread(new Runnable()
{
#Override
public void run()
{
try{
int i = 0;
while (isPlaying() && isMainScreenONOFF()) {
camera.read(frame);
//set default and max frame speed
camera.set(Videoio.CAP_PROP_FPS, 25);
//get frame speed just incase it did not set
fps = camera.get(Videoio.CAP_PROP_FPS);
//if(frame.height() > imgHeight || frame.width() > imgWidth)
Imgproc.resize(frame, frame, frameSize);
//check if to convert or not
if(getblackAndWhite())
Imgproc.cvtColor(frame, frame, Imgproc.COLOR_RGB2GRAY);
imag = frame.clone();
if(rOI){
//incase user adjusted the lines we try calculate there new sizes
adjustLinesPositionAndSize(xAxisSlider.getValue(), yAxisSlider.getValue());
//then we continue and draw the lines
if(!roadIdentified)
roadTypeIdentifier(getPointA1(), getPointA2());
}
viewClass.updateCarCounter(tracker.getCountAB(), tracker.getCountBA());
if (i == 0) {
// jFrame.setSize(FRAME_WIDTH, FRAME_HEIGHT);
diffFrame = new Mat(outbox.size(), CvType.CV_8UC1);
diffFrame = outbox.clone();
}
if (i == 1) {
diffFrame = new Mat(frame.size(), CvType.CV_8UC1);
removeBackground(frame, diffFrame, mBGSub, thresHold.getValue(), learningRate.getValue());
frame = diffFrame.clone();
array = detectionContours(diffFrame, maximumBlob.getValue(), minimumBlob.getValue());
Vector<VehicleTrack> detections = new Vector<>();
Iterator<Rect> it = array.iterator();
while (it.hasNext()) {
Rect obj = it.next();
int ObjectCenterX = (int) ((obj.tl().x + obj.br().x) / 2);
int ObjectCenterY = (int) ((obj.tl().y + obj.br().y) / 2);
//try counter
//add centroid and bounding rectangle
Point pt = new Point(ObjectCenterX, ObjectCenterY);
VehicleTrack track = new VehicleTrack(frame, pt, obj);
detections.add(track);
}
if (array.size() > 0) {
tracker.update(array, detections, imag);
Iterator<Rect> it3 = array.iterator();
while (it3.hasNext()) {
Rect obj = it3.next();
int ObjectCenterX = (int) ((obj.tl().x + obj.br().x) / 2);
int ObjectCenterY = (int) ((obj.tl().y + obj.br().y) / 2);
Point pt = null;
pt = new Point(ObjectCenterX, ObjectCenterY);
Imgproc.rectangle(imag, obj.br(), obj.tl(), new Scalar(0, 255, 0), 2);
Imgproc.circle(imag, pt, 1, new Scalar(0, 0, 255), 2);
//count and eleminate counted
tracker.removeCounted(tracker.tracks);
}
} else if (array.size() == 0) {
tracker.updateKalman(imag, detections);
}
}
i = 1;
//Convert Image and display to View
displayVideo();
}
//if error occur or video finishes
Image image = new Image("/assets/eyeMain.png");
viewClass.updateMainImageView(image);
}catch(Exception e)
{
e.printStackTrace();
System.out.println("Video Stopped Unexpectedly");
}
//thread is done
}
});videProcessorThread.start();
}

As no-one else has replied, I'll have a go.
You've already covered the main technical aspects in your questions (locking, synchronisation etc). Whichever way you look at it, there is no general solution to designing a multi-threaded system. If you have threads accessing the same objects you need to design your synchronisation and you can get threads blocking each other, slowing everything down.
The first thing to do is to do some performance profiling as there is no point making things run in parallel if they are not slowing things down.
That said, I think there are three approaches you could take in your case.
Have a single thread process each frame but have a pool of threads processing frames in parallel. If it takes a second to process a frame and you have 25fps you'd need at least 25 threads to keep up with the frame rate. You'd always be about a second behind real time but you should be able to keep up with the frame rate.
A typical way to implement this would be to put the incoming frames in a queue. You then have a pool of threads reading the latest frame from the queue and processing it. The downside of this design is that you can't guarantee in which order you would get the results of the processing so you might need to add some more logic around sorting the results.
The advantages are that:
There is very little contention, just around getting the frames off the queue, and that should be minimal
It is easy to tune and scale by adjusting the number of threads. It could even run on multiple machines, depending on how easy it is to move frames between machines
You avoid the overhead of creating a new thread each time as each thread processing one frame after another
It is easy to monitor as you can just look at the size of the queue
Error handling can be implemented easily, eg use ActiveMQ to re-queue a frame if a thread crashes.
Run parts of your algorithm in parallel. The way you've written it (pre-process, process, post-process), I don't see this is suitable as you can't do the post processing at the same time as the pre-processing. However, if you can express your algorithm in steps that can be run in parallel then it might work.
Try and run specific parts of your code in parallel. Looking at the code you posted the iterators are the obvious choice. Is there any reason not to run the iterator loops in parallel? If you can, experiment with the Java parallel streams to see if that bring any performance gains.
Personally I'd try option 1 first as its quick and simple.

Usage of ForkJoinTask.getSurplusQueuedTaskCount()

Java doc of RecursiveAction mentions:
The following example illustrates some refinements and idioms that may lead to better performance: RecursiveActions need not be fully recursive, so long as they maintain the basic divide-and-conquer approach. Here is a class that sums the squares of each element of a double array, by subdividing out only the right-hand-sides of repeated divisions by two, and keeping track of them with a chain of next references. It uses a dynamic threshold based on method getSurplusQueuedTaskCount, but counterbalances potential excess partitioning by directly performing leaf actions on unstolen tasks rather than further subdividing.
The related code:
protected void compute() {
int l = lo;
int h = hi;
Applyer right = null;
while (h - l > 1 && getSurplusQueuedTaskCount() <= 3) {
int mid = (l + h) >>> 1;
right = new Applyer(array, mid, h, right);
right.fork();
h = mid;
}
double sum = atLeaf(l, h);
while (right != null) {
if (right.tryUnfork()) // directly calculate if not stolen
sum += right.atLeaf(right.lo, right.hi);
else {
right.join();
sum += right.result;
}
right = right.next;
}
result = sum;
}
I just wonder the reasoning of getSurplusQueuedTaskCount() <= 3.
Java doc of ForkJoinTask.getSurplusQueuedTaskCount() mentions:
Returns an estimate of how many more locally queued tasks are held by the current worker thread than there are other worker threads that might steal them. This value may be useful for heuristic decisions about whether to fork other tasks. In many usages of ForkJoinTasks, at steady state, each worker should aim to maintain a small constant surplus (for example, 3) of tasks, and to process computations locally if this threshold is exceeded.
Again, why we have to process computations locally if this threshold is exceeded?
My guess:
getSurplusQueuedTaskCount() = number of locally queued tasks - number of other worker threads that might steal them (locally queued tasks)
getSurplusQueuedTaskCount() > 3 means locally queued tasks outnumber other worker threads. Thus, other worker threads are already so busy that they won't be able to steal any newly created tasks. Thus, the current thread should perform the calculation instead of subdividing calculation (i.e. creating new task).
Is my guess correct?

Split Files in a directory uniformly across threads in JAVA

I have a variable list of files in a directory and I have different threads in Java to process them. The threads are variable depending upon the current processor
int numberOfThreads=Runtime.getRuntime().availableProcessors();
File[] inputFilesArr=currentDirectory.listFiles();
How do I split the files uniformly across threads? If I do simple math like
int filesPerThread=inputFilesArr.length/numberOfThreads
then I might end up missing some files if the inputFilesArr.length and numberOfThreads are not exactly divisible by each other. What is an efficient way of doing this so that the partition and load across all the threads are uniform?

Here is another take on this problem:
Use java's ThreaPoolExecutor. Here is an example.
It works on the principle of Thread Pool (you need not create threads every time you need but creates a specified number of threads at the start and uses the threads from the pool)
Idea is to treat the processing of each file in a directory as independent task, to be performed by each thread.
Now when you submit all tasks to the executor in loop (this makes sure that no files are left out).
Executor will actually add all of these tasks to a queue and the same time it will pick up threads from the Thread pool and assign them the task till all the threads are busy.
It waits till a thread becomes available. So configuring the threadpool size is vital here. Either you can have as many threads as number of files or lesser number than that.
Here I made an assumption that each file to be processed is independent of each other and its not required that a certain bunch of files to be processed by a single thread.

You can use round robin algorithm for most optimal distribution. Here is the pseudocode:
ProcessThread t[] = new ProcessThread[Number of Cores];
int i = 0;
foreach(File f in files)
{
t[i++ % t.length].queueForProcessing(f);
}
foreach(Thread tt in t)
{
tt.join();
}

The Producer Consumer pattern will solve this gracefully. Have one producer (the main thread) put all the files on a bound blocking queue (see BlockingQueue). Then have a number of worker threads take a file from the queue and process it.
The work (rather than the files) will be uniformly distributed over threads, since threads that are done processing one file, come ask for the next file to process. This avoids the possible problem that one thread gets assigned only large files to process, and other threads get only small files to process.

you can try to get the range (index of start and end in inputFilesArr) of files per thread:
if (inputFilesArr.length < numberOfThreads)
numberOfThreads = inputFilesArr.length;
int[][] filesRangePerThread = getFilesRangePerThread(inputFilesArr.length, numberOfThreads);
and
private static int[][] getFilesRangePerThread(int filesCount, int threadsCount)
{
int[][] filesRangePerThread = new int[threadsCount][2];
if (threadsCount > 1)
{
float odtRangeIncrementFactor = (float) filesCount / threadsCount;
float lastEndIndexSet = odtRangeIncrementFactor - 1;
int rangeStartIndex = 0;
int rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[0] = new int[] { rangeStartIndex, rangeEndIndex };
for (int processCounter = 1; processCounter < threadsCount; processCounter++)
{
rangeStartIndex = rangeEndIndex + 1;
lastEndIndexSet += odtRangeIncrementFactor;
rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[processCounter] = new int[] { rangeStartIndex, rangeEndIndex };
}
}
else
{
filesRangePerThread[0] = new int[] { 0, filesCount - 1 };
}
return filesRangePerThread;
}

If you are dealing with I/O even with one processor multiple threads can work in parallel, because while one thread is waiting on read(byte[]) processor can run another thread.
Anyway, this is my solution
int nThreads = 2;
File[] files = new File[9];
int filesPerThread = files.length / nThreads;
class Task extends Thread {
List<File> list = new ArrayList<>();
// implement run here
}
Task task = new Task();
List<Task> tasks = new ArrayList<>();
tasks.add(task);
for (int i = 0; i < files.length; i++) {
if (task.list.size() == filesPerThread && files.length - i >= filesPerThread) {
task = new Task();
tasks.add(task);
}
task.list.add(files[i]);
}
for(Task t : tasks) {
System.out.println(t.list.size());
}
prints 4 5
Note that it will create 3 threads if you have 3 files and 5 processors

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.

All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?

I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.

When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)

You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.