I am writing a small Java application to analyze a large number of image files. For now, it finds the brightest image in a folder by averaging the brightness of every pixel in the image and comparing it to the other images in the folder.
Sometimes, I get a rate of 100+ images/second right after startup, but this almost always drops to < 20 images/second, and I'm not sure why. When it is at 100+ images/sec, the CPU usage is 100%, but then it drops to around 20%, which seems too low.
Here's the main class:
public class ImageAnalysis {
public static final ConcurrentLinkedQueue<File> queue = new ConcurrentLinkedQueue<>();
private static final ConcurrentLinkedQueue<ImageResult> results = new ConcurrentLinkedQueue<>();
private static int size;
private static AtomicInteger running = new AtomicInteger();
private static AtomicInteger completed = new AtomicInteger();
private static long lastPrint = 0;
private static int completedAtLastPrint;
public static void main(String[] args){
File rio = new File(IO.CAPTURES_DIRECTORY.getAbsolutePath() + File.separator + "Rio de Janeiro");
String month = "12";
Collections.addAll(queue, rio.listFiles((dir, name) -> {
return (name.substring(0, 2).equals(month));
}));
size = queue.size();
ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
for (int i = 0; i < 8; i++){
AnalysisThread t = new AnalysisThread();
t.setPriority(Thread.MAX_PRIORITY);
executor.execute(t);
running.incrementAndGet();
}
}
public synchronized static void finished(){
if (running.decrementAndGet() <= 0){
ImageResult max = new ImageResult(null, 0);
for (ImageResult r : results){
if (r.averageBrightness > max.averageBrightness){
max = r;
}
}
System.out.println("Max Red: " + max.averageBrightness + " File: " + max.file.getAbsolutePath());
}
}
public synchronized static void finishedImage(ImageResult result){
results.add(result);
int c = completed.incrementAndGet();
if (System.currentTimeMillis() - lastPrint > 10000){
System.out.println("Completed: " + c + " / " + size + " = " + ((double) c / (double) size) * 100 + "%");
System.out.println("Rate: " + ((double) c - (double) completedAtLastPrint) / 10D + " images / sec");
completedAtLastPrint = c;
lastPrint = System.currentTimeMillis();
}
}
}
And the thread class:
public class AnalysisThread extends Thread {
#Override
public void run() {
while(!ImageAnalysis.queue.isEmpty()) {
File f = ImageAnalysis.queue.poll();
BufferedImage image;
try {
image = ImageIO.read(f);
double color = 0;
for (int x = 0; x < image.getWidth(); x++) {
for (int y = 0; y < image.getHeight(); y++) {
//Color c = new Color(image.getRGB(x, y));
color += image.getRGB(x,y);
}
}
color /= (image.getWidth() * image.getHeight());
ImageAnalysis.finishedImage((new ImageResult(f, color)));
} catch (IOException e) {
e.printStackTrace();
}
}
ImageAnalysis.finished();
}
}
You appear to have a mixed up both using a thread pool and creating threads of your own. I suggest you use on or the other. In fact I suggest you only use the fixed thread pool
Most likely what is happening is your threads are getting an exception which is being lost but killing the task which kills the thread.
I suggest you just the the thread pool, don't attempt to create your own threads, or queue as this is that the ExecutorService does for you. For each task, submit it to the pool, one per image and if you are not going to check the Error of any task, I suggest you trap all Throwable and log them otherwise you could get a RuntimeExcepion or Error and have no idea this happened.
If you have Java 8, a simpler approach would be to use parallelStream(). You can use this to analyse the images concurrently and collect the results without having to divide up the work and collect the results. e.g
List<ImageResults> results = Stream.of(rio.listFiles())
.parallel()
.filter(f -> checkFile(f))
.map(f -> getResultsFor(f))
.list(Collectors.toList());
I see two reasons why you may experience CPU usage deterioration:
your tasks are very I/O intensive (reading images - ImageIO.read(f));
there is thread contention over the synchronized method that your threads access;
Further the sizes of the images may influence execution times.
To exploit parallelism efficiently I would suggest that you redesign your app and implement two kind of tasks that would be submitted to the executor:
the first tasks (producers) would be I/O intensive and will read the image data and queue it for in-memory processing;
the other (consumers) will pull and analyze the image information;
Then with some profiling you will be able to determine the correct ratio between producers and consumers.
The problem I could see here is the usage of queues in the high-performance concurrency model you are looking for. Using a queue is not optimal while using with a modern CPU design. Queue implementations have write contention on the head, tail and size variables. They are either always close to full or close to empty due to differences in pace between consumers and producers, especially while using in a high I/O situation. This results in high levels of contention. Further, in Java queues are significant source of garbage.
What I suggest is to apply Mechanical Sympathy while designing your code. One of the best solution you can have is the usage of LMAX Disruptor, which is a high performance inter-thread messaging library, which is aimed to solve this concurrency problem
Additional References
http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf
http://martinfowler.com/articles/lmax.html
https://dzone.com/articles/mechanical-sympathy
http://www.infoq.com/presentations/mechanical-sympathy
Related
I am experimenting with multithreading in Java, more specifically, threadpools. As a test, I have written an application that simply changes the color of an image using multithreading for speed. However, for some reason unknown to me, I get corrupted results depending on how I set up this test. Below I describe how the test application works together with the complete source code.
Any help is very welcome! Thank you!
The Test Application
I have a 400x300 pixel image buffer that is initialized with the dark blue color, as shown below:
The program must fill it up completely with the red color.
Although I could simply loop over all pixels, coloring each one sequentially with red, I've decided, for performance, to take advantage of parallelism. Thus, I've decided to fill each image row with a separate thread. Since the number of rows (300 rows) is much larger then the number of available CPU cores, I've created a threadpool (containing 4 threads) that will consume 300 tasks (each one in charge of filling up one row).
The program is organized as follows:
RGB class: holds the pixel color in a 3-tuple of doubles.
RenderTask class: fills up a given row of the image buffer with the red color.
Renderer class:
creates the image buffer.
creates the threadpool with "newFixedThreadPool".
creates 300 tasks to be consumed by the threadpool.
finishes the threadpool service.
writes the image buffer to a PPM file.
Below you may find the complete source code (I will call this code Version 1):
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;
import java.io.*;
class RGB {
RGB() {}
RGB(double r, double g, double b) {
this.r = r;
this.g = g;
this.b = b;
}
double r;
double g;
double b;
}
class RenderTask implements Runnable {
RenderTask(RGB[][] image_buffer, int row_width, int current_row) {
this.image_buffer = image_buffer;
this.row_width = row_width;
this.current_row = current_row;
}
#Override
public void run() {
for(int column = 0; column < row_width; ++column) {
image_buffer[current_row][column] = new RGB(1.0, 0.0, 0.0);
}
}
RGB[][] image_buffer;
int row_width;
int current_row;
}
public class Renderer {
public static void main(String[] str) {
int image_width = 400;
int image_height = 300;
// Creates a 400x300 pixel image buffer, where each pixel is RGB triple of doubles,
// and initializes the image buffer with a dark blue color.
RGB[][] image_buffer = new RGB[image_height][image_width];
for(int row = 0; row < image_height; ++row)
for(int column = 0; column < image_width; ++column)
image_buffer[row][column] = new RGB(0.0, 0.0, 0.2); // dark blue
// Creates a threadpool containing four threads
ExecutorService executor_service = Executors.newFixedThreadPool(4);
// Creates 300 tasks to be consumed by the threadpool:
// Each task will be in charge of filling one line of the image buffer.
for(int row = 0; row < image_height; ++row)
executor_service.submit(new RenderTask(image_buffer, image_width, row));
executor_service.shutdown();
// Saves the image buffer to a PPM file in ASCII format
try (FileWriter fwriter = new FileWriter("image.ppm");
BufferedWriter bwriter = new BufferedWriter(fwriter)) {
bwriter.write("P3\n" + image_width + " " + image_height + "\n" + 255 + "\n");
for(int row = 0; row < image_height; ++row)
for(int column = 0; column < image_width; ++column) {
int r = (int) (image_buffer[row][column].r * 255.0);
int g = (int) (image_buffer[row][column].g * 255.0);
int b = (int) (image_buffer[row][column].b * 255.0);
bwriter.write(r + " " + g + " " + b + " ");
}
} catch (IOException e) {
System.err.format("IOException: %s%n", e);
}
}
}
Everything seems to be working with that code, and I get the expected red image buffer, as shown below:
The Problem
However, if I modify the RenderTask.run() method such that it re-sets redundantly the color of the same buffer position several times in sequence, as shown below (I will call this one Version 2):
#Override
public void run() {
for(int column = 0; column < row_width; ++column) {
for(int s = 0; s < 256; ++s) {
image_buffer[current_row][column] = new RGB(1.0, 0.0, 0.0);
}
}
}
Then I get the following corrupted image buffer:
Actually, the result is different each time I run the program, but always corrupted.
As far as I understand it, no two threads are writing to the same memory position simultaneously, so it seems that there is no racing condition in sight.
Even in the case of "false sharing", which I don't think is happening, I would expect only lower performance, not corrupted results.
Thus, even with redundant assignments, I would expect to get the correct result (i.e. a completely red image buffer).
So, my questions are: Why is this happening to the Version 2 of the program if the only difference with respect to Version 1 is that the assignment operation is being executed redundantly within the scope of the thread?
Would it be the case that some threads are being destroyed before they finish? Would it be a bug in the JVM?
Or have I missed something trivial? (the strongest hypothesis :)
Thank you guys!!
ExecutorService.shutdown() does not await termination of the tasks it has, it only stops accepting new tasks.
After you have called shutdown you should call awaitTermination on the executor service if you want to wait for it to finish.
So what is happening is that all the tasks has not yet finished executing when you are starting to write the image to file.
#emil is correct. To add on to the answer, you can use the following code to close your threadpool
The following method shuts down an ExecutorService in two phases, first by calling shutdown to reject incoming tasks, and then calling shutdownNow, if necessary, to cancel any lingering tasks:
void shutdownAndAwaitTermination(ExecutorService pool) {
pool.shutdown(); // Disable new tasks from being submitted
try {
// Wait a while for existing tasks to terminate
if (!pool.awaitTermination(60, TimeUnit.SECONDS)) {
pool.shutdownNow(); // Cancel currently executing tasks
// Wait a while for tasks to respond to being cancelled
if (!pool.awaitTermination(60, TimeUnit.SECONDS))
System.err.println("Pool did not terminate");
}
} catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted
pool.shutdownNow();
// Preserve interrupt status
Thread.currentThread().interrupt();
}
}
source: https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/concurrent/ExecutorService.html
I achieved to calculate factorial with two threads without the pool. I have two factorial classes which are named Factorial1, Factorial2 and extends Thread class. Let's consider I want to calculate the value of !160000. In Factorial1's run() method I do the multiplication in a for loop from i=2 to i=80000 and in Factorial2's from i=80001 to 160000. After that, i return both values and multiply them in the main method. When I compare the execution time it's much better (which is 5000 milliseconds) than the non-thread calculation's time (15000 milliseconds) even with two threads.
Now I want to write clean and better code because I saw the efficiency of threads at factorial calculation but when I use a thread pool to calculate the factorial value, the parallel calculation always takes more time than the non-thread calculation (nearly 16000). My code pieces look like:
for(int i=2; i<= Calculate; i++)
{
myPool.execute(new Multiplication(result, i));
}
run() method which is in Multiplication class:
public void run()
{
s1.Mltply(s2); // s1 and s2 are instances of my Number class
// their fields holds BigInteger values
}
Mltply() method which is in Number class:
public void Multiply(int number)
{
area.lock(); // result is going wrong without lock
Number temp = new Number(number);
value = value.multiply(temp.value); // value is a BigInteger
area.unlock();
}
In my opinion this lock may kills the all advantage of the thread usage because it seems like all that threads do is multiplication but nothing else. But without it, i can't even calculate the true result. Let's say i want to calculate !10, so thread1 calculates the 10*9*8*7*6 and thread2 calculate the 5*4*3*2*1. Is that the way I'm looking for? Is it even possible with thread pool? Of course execution time must be less than the normal calculation...
I appreciate all your help and suggestion.
EDIT: - My own solution to the problem -
public class MyMultiplication implements Runnable
{
public static BigInteger subResult1;
public static BigInteger subResult2;
int thread1StopsAt;
int thread2StopsAt;
long threadId;
static boolean idIsSet=false;
public MyMultiplication(BigInteger n1, int n2) // First Thread
{
MyMultiplication.subResult1 = n1;
this.thread1StopsAt = n2/2;
thread2StopsAt = n2;
}
public MyMultiplication(int n2,BigInteger n1) // Second Thread
{
MyMultiplication.subResult2 = n1;
this.thread2StopsAt = n2;
thread1StopsAt = n2/2;
}
#Override
public void run()
{
if(idIsSet==false)
{
threadId = Thread.currentThread().getId();
idIsSet=true;
}
if(Thread.currentThread().getId() == threadId)
{
for(int i=2; i<=thread1StopsAt; i++)
{
subResult1 = subResult1.multiply(BigInteger.valueOf(i));
}
}
else
{
for(int i=thread1StopsAt+1; i<= thread2StopsAt; i++)
{
subResult2 = subResult2.multiply(BigInteger.valueOf(i));
}
}
}
}
public class JavaApplication3
{
public static void main(String[] args) throws InterruptedException
{
int calculate=160000;
long start = System.nanoTime();
BigInteger num = BigInteger.valueOf(1);
for (int i = 2; i <= calculate; i++)
{
num = num.multiply(BigInteger.valueOf(i));
}
long end = System.nanoTime();
double time = (end-start)/1000000.0;
System.out.println("Without threads: \t" +
String.format("%.2f",time) + " miliseconds");
System.out.println("without threads Result: " + num);
BigInteger num1 = BigInteger.valueOf(1);
BigInteger num2 = BigInteger.valueOf(1);
ExecutorService myPool = Executors.newFixedThreadPool(2);
start = System.nanoTime();
myPool.execute(new MyMultiplication(num1,calculate));
Thread.sleep(100);
myPool.execute(new MyMultiplication(calculate,num2));
myPool.shutdown();
while(!myPool.isTerminated()) {} // waiting threads to end
end = System.nanoTime();
time = (end-start)/1000000.0;
System.out.println("With threads: \t" +String.format("%.2f",time)
+ " miliseconds");
BigInteger result =
MyMultiplication.subResult1.
multiply(MyMultiplication.subResult2);
System.out.println("With threads Result: " + result);
System.out.println(MyMultiplication.subResult1);
System.out.println(MyMultiplication.subResult2);
}
}
input : !160000
Execution time without threads : 15000 milliseconds
Execution time with 2 threads : 4500 milliseconds
Thanks for ideas and suggestions.
You may calculate !160000 concurrently without using a lock by splitting 160000 into disjunct junks as you explaint by splitting it into 2..80000 and 80001..160000.
But you may achieve this by using the Java Stream API:
IntStream.rangeClosed(1, 160000).parallel()
.mapToObj(val -> BigInteger.valueOf(val))
.reduce(BigInteger.ONE, BigInteger::multiply);
It does exactly what you try to do. It splits the whole range into junks, establishes a thread pool and computes the partial results. Afterwards it joins the partial results into a single result.
So why do you bother doing it by yourself? Just practicing clean coding?
On my real 4 core machine computation in a for loop took 8 times longer than using a parallel stream.
Threads have to run independent to run fast. Many dependencies like locks, synchronized parts of your code or some system calls leads to sleeping threads which are waiting to access some resources.
In your case you should minimize the time a thread is inside the lock. Maybe I am wrong, but it seems like you create a thread for each number. So for 1.000! you spawn 1.000 Threads. All of them trying to get the lock on area and are not able to calculate anything, because one thread has become the lock and all other threads have to wait until the lock is unlocked again. So the threads are only running in serial which is as fast as your non-threaded example plus the extra time for locking and unlocking, thread management and so on. Oh, and because of cpu's context switching it gets even worse.
Your first attempt to splitt the factorial in two threads is the better one. Each thread can calculate its own result and only when they are done the threads have to communicate with each other. So they are independent most of the time.
Now you have to generalize this solution. To reduce context switching of the cpu you only want as many threads as your cpu has cores (maybe a little bit less because of your OS). Every thread gets a rang of numbers and calculates their product. After this it locks the overall result and adds its own result to it.
This should improve the performance of your problem.
Update: You ask for additional advice:
You said you have two classes Factorial1 and Factorial2. Probably they have their ranges hard codes. You only need one class which takes the range as constructor arguments. This class implements Runnable so it has a run-Method which multiplies all values in that range.
In you main-method you can do something like that:
int n = 160_000;
int threads = 2;
ExecutorService executor = Executors.newFixedThreadPool(threads);
for (int i = 0; i < threads; i++) {
int start = i * (n/threads) + 1;
int end = (i + 1) * (n/threads) + 1;
executor.execute(new Factorial(start, end));
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.DAYS);
Now you have calculated the result of each thread but not the overall result. This can be solved by a BigInteger which is visible to the Factorial-class (like a static BigInteger reuslt; in the same main class.) and a lock, too. In the run-method of Factorial you can calculate the overall result by locking the lock and calculation the result:
Main.lock.lock();
Main.result = Main.result.multiply(value);
Main.lock.unlock();
Some additional advice for the future: This isn't really clean because Factorial needs to have information about your main class, so it has a dependency to it. But ExecutorService returns a Future<T>-Object which can be used to receive the result of the thread. Using this Future-Object you don't need to use locks. But this needs some extra work, so just try to get this running for now ;-)
In addition to my Java Stream API solution here another solution which uses a self-managed thread-pool as you demanded:
public static final int CHUNK_SIZE = 10000;
public static BigInteger fac(int max) {
ExecutorService executor = newCachedThreadPool();
try {
return rangeClosed(0, (max - 1) / CHUNK_SIZE)
.mapToObj(val -> executor.submit(() -> prod(leftBound(val), rightBound(val, max))))
.map(future -> valueOf(future))
.reduce(BigInteger.ONE, BigInteger::multiply);
} finally {
executor.shutdown();
}
}
private static int leftBound(int chunkNo) {
return chunkNo * CHUNK_SIZE + 1;
}
private static int rightBound(int chunkNo, int max) {
return Math.min((chunkNo + 1) * CHUNK_SIZE, max);
}
private static BigInteger valueOf(Future<BigInteger> future) {
try {
return future.get();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private static BigInteger prod(int min, int max) {
BigInteger res = BigInteger.valueOf(min);
for (int val = min + 1; val <= max; val++) {
res = res.multiply(BigInteger.valueOf(val));
}
return res;
}
I need to factor a 64-bit number (n = pq).
So I implemented a method which searches consequentially all numbers in range of [1; sqrt(n)].
It took a 27 secs to execute on Android with 1,2 GHz processor (unfortunately, I don't know a number of CPU cores). So I decided to make it parallel. Well, two Runnables giving me results in 51 secs and 3 — in 83.
My program does nothing but calling this method in onCreate.
final static private int WORKERS_COUNT = 3;
final static public int[] pqFactor(final long pq) {
stopFactorFlag = false;
long blockSize = (long)Math.ceil(Math.sqrt(pq) / WORKERS_COUNT);
ExecutorService executor = Executors.newFixedThreadPool(WORKERS_COUNT);
for (int workerIdx = 0; workerIdx < WORKERS_COUNT; ++workerIdx) {
Runnable worker = new FactorTask(pq, workerIdx * blockSize, (workerIdx + 1) * blockSize);
executor.execute(worker);
}
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
e.printStackTrace();
}
return result;
}
private static boolean stopFactorFlag;
private static int p, q;
static private class FactorTask implements Runnable {
final private long pq;
private long leftBorder;
private long rightBorder;
public long pInternal;
public long qInternal;
/* Constructor was there */
#Override
public void run() {
for (qInternal = rightBorder; !stopFactorFlag && qInternal > leftBorder && qInternal > 1L; qInternal -= 2L) {
if (pq % qInternal == 0L) {
pInternal = pq / qInternal;
p = (int)pInternal;
q = (int)qInternal;
stopFactorFlag = true;
break;
}
}
}
}
P. S. This is not a homework, I really need this. Maybe the other way.
Executing 2 or more Runnables causes performance issues
This looks to me that your Android device has either 1 or 2 cores and that adding threads to your problem is not going to make it run faster because you have exhausted your CPU resources. I'd recommend looking up your device specs to determine how many cores it has.
If I run your code under my 4 core MacBook Pro:
2 threads in ~6secs
3 threads in ~4secs
4 threads in ~3.5secs
This seems to me to be reasonably linear (taking into account startup/shutdown overhead) and indicates to me that it is not the code that is holding you back.
Btw, the stopFactorFlag should be volatile. Also I don't see how you are creating your result array but I'm worried about the race conditions there.
I have a variable list of files in a directory and I have different threads in Java to process them. The threads are variable depending upon the current processor
int numberOfThreads=Runtime.getRuntime().availableProcessors();
File[] inputFilesArr=currentDirectory.listFiles();
How do I split the files uniformly across threads? If I do simple math like
int filesPerThread=inputFilesArr.length/numberOfThreads
then I might end up missing some files if the inputFilesArr.length and numberOfThreads are not exactly divisible by each other. What is an efficient way of doing this so that the partition and load across all the threads are uniform?
Here is another take on this problem:
Use java's ThreaPoolExecutor. Here is an example.
It works on the principle of Thread Pool (you need not create threads every time you need but creates a specified number of threads at the start and uses the threads from the pool)
Idea is to treat the processing of each file in a directory as independent task, to be performed by each thread.
Now when you submit all tasks to the executor in loop (this makes sure that no files are left out).
Executor will actually add all of these tasks to a queue and the same time it will pick up threads from the Thread pool and assign them the task till all the threads are busy.
It waits till a thread becomes available. So configuring the threadpool size is vital here. Either you can have as many threads as number of files or lesser number than that.
Here I made an assumption that each file to be processed is independent of each other and its not required that a certain bunch of files to be processed by a single thread.
You can use round robin algorithm for most optimal distribution. Here is the pseudocode:
ProcessThread t[] = new ProcessThread[Number of Cores];
int i = 0;
foreach(File f in files)
{
t[i++ % t.length].queueForProcessing(f);
}
foreach(Thread tt in t)
{
tt.join();
}
The Producer Consumer pattern will solve this gracefully. Have one producer (the main thread) put all the files on a bound blocking queue (see BlockingQueue). Then have a number of worker threads take a file from the queue and process it.
The work (rather than the files) will be uniformly distributed over threads, since threads that are done processing one file, come ask for the next file to process. This avoids the possible problem that one thread gets assigned only large files to process, and other threads get only small files to process.
you can try to get the range (index of start and end in inputFilesArr) of files per thread:
if (inputFilesArr.length < numberOfThreads)
numberOfThreads = inputFilesArr.length;
int[][] filesRangePerThread = getFilesRangePerThread(inputFilesArr.length, numberOfThreads);
and
private static int[][] getFilesRangePerThread(int filesCount, int threadsCount)
{
int[][] filesRangePerThread = new int[threadsCount][2];
if (threadsCount > 1)
{
float odtRangeIncrementFactor = (float) filesCount / threadsCount;
float lastEndIndexSet = odtRangeIncrementFactor - 1;
int rangeStartIndex = 0;
int rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[0] = new int[] { rangeStartIndex, rangeEndIndex };
for (int processCounter = 1; processCounter < threadsCount; processCounter++)
{
rangeStartIndex = rangeEndIndex + 1;
lastEndIndexSet += odtRangeIncrementFactor;
rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[processCounter] = new int[] { rangeStartIndex, rangeEndIndex };
}
}
else
{
filesRangePerThread[0] = new int[] { 0, filesCount - 1 };
}
return filesRangePerThread;
}
If you are dealing with I/O even with one processor multiple threads can work in parallel, because while one thread is waiting on read(byte[]) processor can run another thread.
Anyway, this is my solution
int nThreads = 2;
File[] files = new File[9];
int filesPerThread = files.length / nThreads;
class Task extends Thread {
List<File> list = new ArrayList<>();
// implement run here
}
Task task = new Task();
List<Task> tasks = new ArrayList<>();
tasks.add(task);
for (int i = 0; i < files.length; i++) {
if (task.list.size() == filesPerThread && files.length - i >= filesPerThread) {
task = new Task();
tasks.add(task);
}
task.list.add(files[i]);
}
for(Task t : tasks) {
System.out.println(t.list.size());
}
prints 4 5
Note that it will create 3 threads if you have 3 files and 5 processors
I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.