Mulithreading Usage - java

I am iterating through a HashMap with +- 20 Million entries. In each iteration I am again iterating through HashMap with +- 20 Million entries.
HashMap<String, BitSet> data_1 = new HashMap<String, BitSet>
HashMap<String, BitSet> data_2 = new HashMap<String, BitSet>
I am dividng data_1 into chunks based on number of threads(threads = cores, i have four core processor).
My code is taking more than 20 Hrs to excute. Excluding not storing the results into a file.
1) If i want to store the results of each thread without overlapping into a file, How can i
do that?.
2) How can i make the following much faster.
3) How to create the chunks dynamically, based on number of cores?
int cores = Runtime.getRuntime().availableProcessors();
int threads = cores;
//Number of threads
int Chunks = data_1.size() / threads;
//I don't trust with chunks created by the below line, that's why i created chunk1, chunk2, chunk3, chunk4 seperately and validated them.
Map<Integer, BitSet>[] Chunk= (Map<Integer, BitSet>[]) new HashMap<?,?>[threads];
4) How to create threads using for loops? Is it correct what i am doing?
ClassName thread1 = new ClassName(data2, chunk1);
ClassName thread2 = new ClassName(data2, chunk2);
ClassName thread3 = new ClassName(data2, chunk3);
ClassName thread4 = new ClassName(data2, chunk4);
thread1.start();
thread2.start();
thread3.start();
thread4.start();
thread1.join();
thread2.join();
thread3.join();
thread4.join();
Representation of My Code
Public class ClassName {
Integer nSimilarEntities = 30;
public void run() {
for (String kNonRepeater : data_1.keySet()) {
// Extract the feature vector
BitSet vFeaturesNonRepeater = data_1.get(kNonRepeater);
// Calculate the sum of 1s (L2 norm is the sqrt of this)
double nNormNonRepeater = Math.sqrt(vFeaturesNonRepeater.cardinality());
// Loop through the repeater set
double nMinSimilarity = 100;
int nMinSimIndex = 0;
// Maintain the list of top similar repeaters and the similarity values
long dpind = 0;
ArrayList<String> vSimilarKeys = new ArrayList<String>();
ArrayList<Double> vSimilarValues = new ArrayList<Double>();
for (String kRepeater : data_2.keySet()) {
// Status output at regular intervals
dpind++;
if (Math.floorMod(dpind, pct) == 0) {
System.out.println(dpind + " dot products (" + Math.round(dpind / pct) + "%) out of "
+ nNumSimilaritiesToCompute + " completed!");
}
// Calculate the norm of repeater, and the dot product
BitSet vFeaturesRepeater = data_2.get(kRepeater);
double nNormRepeater = Math.sqrt(vFeaturesRepeater.cardinality());
BitSet vTemp = (BitSet) vFeaturesNonRepeater.clone();
vTemp.and(vFeaturesRepeater);
double nCosineDistance = vTemp.cardinality() / (nNormNonRepeater * nNormRepeater);
// queue.add(new MyClass(kRepeater,kNonRepeater,nCosineDistance));
// if(queue.size() > YOUR_LIMIT)
// queue.remove();
// Don't bother if the similarity is 0, obviously
if ((vSimilarKeys.size() < nSimilarEntities) && (nCosineDistance > 0)) {
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
} else { // If there are more, keep only the best
// If this is better than the smallest distance, then remove the smallest
if (nCosineDistance > nMinSimilarity) {
// Remove the lowest similarity value
vSimilarKeys.remove(nMinSimIndex);
vSimilarValues.remove(nMinSimIndex);
// Add this one
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
// Refresh the index of lowest similarity value
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
}
} // End loop for maintaining list of similar entries
}// End iteration through repeaters
for (int i = 0; i < vSimilarValues.size(); i++) {
System.out.println(Thread.currentThread().getName() + kNonRepeater + "|" + vSimilarKeys.get(i) + "|" + vSimilarValues.get(i));
}
}
}
}
Finally, If not Multithreading, is there any other approaches in java, to reduce time complexity.

The computer works similarly to what you have to do by hand (It processes more digits/bits at a time but the problem is the same.
If you do addition, the time is proportional to the of the size of the number.
If you do multiplication or divisor it's proportional to the square of the size of the number.
For the computer the size is based on multiples of 32 or 64 significant bits depending on the implementation.

I'd say this task is suitable for parallel streams. Don't hesitate to take a look at this conception if you have time. Parallel streams seamlessly use multithreading at full speed.
The top-level processing will look like this:
data_1.entrySet()
.parallelStream()
.flatmap(nonRepeaterEntry -> processOne(nonRepeaterEntry.getKey(), nonRepeaterEntry.getValue(), data2))
.forEach(System.out::println);
You should provide processOne function with prototype like this:
Stream<String> processOne(String nonRepeaterKey, String nonRepeaterBitSet, Map<String BitSet> data2);
It will return prepared string list with what you print now into file.
To make stream inside you can prepare List list first and then turn it into stream in return statement:
return list.stream();
Even though inner loop can be processed in streams, parallel streaming inside is discouraged - you already have enough parallelism.
For your questions:
1) If i want to store the results of each thread without overlapping into a file, How can i do that?.
Any logging framework (logback, log4j) can deal with it. Parallel streams can deal with it. Also you can store prepared lines into some queue/array and print them in separate thread. It takes a bit of care, though, ready solutions are easier and effectively they do such thing.
2) How can i make the following much faster.
Optimize and parallelize. At normal situation you get number_of_threads/1.5..number_of_threads times faster processing thinking you have hyperthreading in play, but it depends on things you do not-so-parallel and underlying implementations of stuff.
3) How to create the chunks dynamically, based on number of cores?
You don't have to. Make a list of tasks (1 task per data_1 entry) and feed executor service with them - that's already big enough task size. You can use FixedThreadPool with number of threads as parameter, and it will deal will distribute tasks evenly.
Not you should create task class, get Future for each task upon threadpool.submit and in the end run a loop doing .get for each Future. It will throttle main thread down to executor processing speed implicitly doing fork-join like behaviour.
4) Direct threads creation is outdated technique. It's recommended to use executor service of some sort, parallel streams etc. For loop processing you need to create list of chunks, and in loop create thread, add it to list of threads. And in another loop join to each thread if the list.
Ad hoc optimizations:
1) Make Repeater class that will store key, bitset and cardinality. Preprocess your hashsets turning them into Repeater instances and calculating cardinality once (i.e. not for every inner loop run). It will save you 20mil*(20mil-1) calls of .cardinality(). You still need to call it for difference.
2) Replace similarKeys, similarValues with limited size priorityQueue on combined entries. It works faster for 30 elements.
Take a look at this question for infor about PriorityQueue:
Java PriorityQueue with fixed size
3) You can skip processing of nonRepeater if its cardinality is already 0 - bitSet and will never increase resulting cardinality, and you'll filter out all 0-distance values.
4) You can skip (remove from temporary list you create in p.1 optimization) every Repeater with zero cardinality. Like in p.3 it will never produce anything fruitful.

Related

Usage of ForkJoinTask.getSurplusQueuedTaskCount()

Java doc of RecursiveAction mentions:
The following example illustrates some refinements and idioms that may lead to better performance: RecursiveActions need not be fully recursive, so long as they maintain the basic divide-and-conquer approach. Here is a class that sums the squares of each element of a double array, by subdividing out only the right-hand-sides of repeated divisions by two, and keeping track of them with a chain of next references. It uses a dynamic threshold based on method getSurplusQueuedTaskCount, but counterbalances potential excess partitioning by directly performing leaf actions on unstolen tasks rather than further subdividing.
The related code:
protected void compute() {
int l = lo;
int h = hi;
Applyer right = null;
while (h - l > 1 && getSurplusQueuedTaskCount() <= 3) {
int mid = (l + h) >>> 1;
right = new Applyer(array, mid, h, right);
right.fork();
h = mid;
}
double sum = atLeaf(l, h);
while (right != null) {
if (right.tryUnfork()) // directly calculate if not stolen
sum += right.atLeaf(right.lo, right.hi);
else {
right.join();
sum += right.result;
}
right = right.next;
}
result = sum;
}
I just wonder the reasoning of getSurplusQueuedTaskCount() <= 3.
Java doc of ForkJoinTask.getSurplusQueuedTaskCount() mentions:
Returns an estimate of how many more locally queued tasks are held by the current worker thread than there are other worker threads that might steal them. This value may be useful for heuristic decisions about whether to fork other tasks. In many usages of ForkJoinTasks, at steady state, each worker should aim to maintain a small constant surplus (for example, 3) of tasks, and to process computations locally if this threshold is exceeded.
Again, why we have to process computations locally if this threshold is exceeded?
My guess:
getSurplusQueuedTaskCount() = number of locally queued tasks - number of other worker threads that might steal them (locally queued tasks)
getSurplusQueuedTaskCount() > 3 means locally queued tasks outnumber other worker threads. Thus, other worker threads are already so busy that they won't be able to steal any newly created tasks. Thus, the current thread should perform the calculation instead of subdividing calculation (i.e. creating new task).
Is my guess correct?

Algorithm optimization - parallel AsyncTasks or threads?

I currently have a single AsyncTask which currently compares images using the bubble sort technique using OpenCV. Say, I have to compare 400 images to each other. This would mean 400*401/2=80,200 comparisons. Let's assume one comparison takes 1 second. So, that's 80,200 sec which is around 22.27 hours which is ridiculously long. So, I developed an algorithm of this type:
It divides the 400 images into groups of 5. So there are 80 images in each group.
The first part of the algorithm is the images comparing themselves within the group members.
So, image1 will compare itself with image2-80, which means there are 79 comparisons. image2 will have 78 comparisons and so on. Which makes 3,160 comparisons. Or 3,160 sec. Similarly, image81 will compare itself with image82-160 and so on. So all the "group comparisons" are finished in 3,160 sec because they're run in parallel.
The second part of the algorithm will compare group 1 elements with group 2 elements, group 2 with group 3, group 3 with group 4 and so on. This would mean image1 will be compared with image81-160, which is 80 comparisons and so total comparisons between group 1 and group 2 would be 80*80=6400 comparisons. Is it possible to have each image comparison in parallel with group comparisons? That is if image1 is comparing itself with image81-160 then image2 should do the same and so on, while the other groups are doing the same. So, this part should take only 6400 sec.
Now, group1 will be compared with group3, group2 with group4, group3 with group5. ->6400 sec
After which, group1 will be compared with group4 and group2 with group5. ->6400 sec
So all groups are compared.
Total time = 3160+6400+6400+6400=22,360sec. I realize the more the groups, more time it'd take. So, I'd have to increase group size to reduce the increase in time. Either way, it cuts down the time to almost 1/4th it's actual time.
Is this algorithm unrealistic? If so, why? What are it's flaws? How would I fix it? Is there a better algorithm to compare a list of images faster? Obviously not quick sort, I can't arrange the images in an ascending or descending order. Or can I?
If this algorithm is possible? What would be the best way to implement it? Thread or AsyncTask?
This is a realistic algorithm, but ideally you'll want to be able to use the same number of worker threads throughout the program. For this you'll need to use an even number of threads, say 8.
On Pass1, Thread1 processes images 1-50, Thread2 processes images 51-100, etc.
On Pass2, Thread1 and Thread2 both process images 1-100. Thread1 processes images 1-25 and 50-75, Thread2 processes images 26-50 and images 76-100. Then Thread1 processes images 1-25 and 76-100, and Thread2 processes images 26-75.
Passes 3 through 8 follow the same pattern - the two threads assigned to the two groups being processed split up the groups between them. This way you keep all of your threads busy. However, you'll need an even number of threads for this in order to simplify group partitioning.
Sample code for 4 threads
class ImageGroup {
final int index1;
final int index2;
}
class ImageComparer implements Runnable {
final Image[] images;
ImageGroup group1;
ImageGroup group2;
public ImageComparer(Image[] images, ImageGroup group1, ImageGroup group2) {
...
}
public void run() {
if(group2 == null) { // Compare images within a single group
for(int i = group1.index1; i < group1.index2; i++) {
for(int j = i + 1; j < group1.inex2; j++) {
compare(images[i], images[j]);
}
}
} else { // Compare images between two groups
for(int i = group1.index1; i < group1.index2; i++) {
for(int j = group2.index1; j < group2.index2; j++) {
compare(images[i], images[j]);
}
}
}
}
}
ExecutorService executor = new ThreadPoolExecutor(); // use a corePoolSize equal to the number of target threads
// for 4 threads we need 8 image groups
ImageGroup group1 = new ImageGroup(0, 50);
ImageGroup group2 = new ImageGroup(50, 100);
...
ImageGroup group8 = new ImageGroup(450, 500);
ImageComparer comparer1 = new ImageComparer(images, group1, null);
ImageComparer comparer2 = new ImageComparer(images, group3, null);
...
ImageComparer comparer4 = new ImageComparer(images, group7, null);
// submit comparers to executor service
Future future1 = executor.submit(comparer1);
Future future2 = executor.submit(comparer2);
Future future3 = executor.submit(comparer3);
Future future4 = executor.submit(comparer4);
// wait for threads to finish
future1.get();
future2.get();
future3.get();
future4.get();
comparer1 = new ImageComparer(images, group2, null);
...
comparer4 = new ImageComparer(images, group8, null);
// submit to executor, wait to finish
comparer1 = new ImageComparer(images, group1, group3);
...
comparer4 = new ImageComparer(images, group7, group6);
// submit to executor, wait to finish
comparer1 = new ImageComparer(images, group1, group4);
...
comparer4 = new ImageComparer(images, group7, group5);

Split Files in a directory uniformly across threads in JAVA

I have a variable list of files in a directory and I have different threads in Java to process them. The threads are variable depending upon the current processor
int numberOfThreads=Runtime.getRuntime().availableProcessors();
File[] inputFilesArr=currentDirectory.listFiles();
How do I split the files uniformly across threads? If I do simple math like
int filesPerThread=inputFilesArr.length/numberOfThreads
then I might end up missing some files if the inputFilesArr.length and numberOfThreads are not exactly divisible by each other. What is an efficient way of doing this so that the partition and load across all the threads are uniform?
Here is another take on this problem:
Use java's ThreaPoolExecutor. Here is an example.
It works on the principle of Thread Pool (you need not create threads every time you need but creates a specified number of threads at the start and uses the threads from the pool)
Idea is to treat the processing of each file in a directory as independent task, to be performed by each thread.
Now when you submit all tasks to the executor in loop (this makes sure that no files are left out).
Executor will actually add all of these tasks to a queue and the same time it will pick up threads from the Thread pool and assign them the task till all the threads are busy.
It waits till a thread becomes available. So configuring the threadpool size is vital here. Either you can have as many threads as number of files or lesser number than that.
Here I made an assumption that each file to be processed is independent of each other and its not required that a certain bunch of files to be processed by a single thread.
You can use round robin algorithm for most optimal distribution. Here is the pseudocode:
ProcessThread t[] = new ProcessThread[Number of Cores];
int i = 0;
foreach(File f in files)
{
t[i++ % t.length].queueForProcessing(f);
}
foreach(Thread tt in t)
{
tt.join();
}
The Producer Consumer pattern will solve this gracefully. Have one producer (the main thread) put all the files on a bound blocking queue (see BlockingQueue). Then have a number of worker threads take a file from the queue and process it.
The work (rather than the files) will be uniformly distributed over threads, since threads that are done processing one file, come ask for the next file to process. This avoids the possible problem that one thread gets assigned only large files to process, and other threads get only small files to process.
you can try to get the range (index of start and end in inputFilesArr) of files per thread:
if (inputFilesArr.length < numberOfThreads)
numberOfThreads = inputFilesArr.length;
int[][] filesRangePerThread = getFilesRangePerThread(inputFilesArr.length, numberOfThreads);
and
private static int[][] getFilesRangePerThread(int filesCount, int threadsCount)
{
int[][] filesRangePerThread = new int[threadsCount][2];
if (threadsCount > 1)
{
float odtRangeIncrementFactor = (float) filesCount / threadsCount;
float lastEndIndexSet = odtRangeIncrementFactor - 1;
int rangeStartIndex = 0;
int rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[0] = new int[] { rangeStartIndex, rangeEndIndex };
for (int processCounter = 1; processCounter < threadsCount; processCounter++)
{
rangeStartIndex = rangeEndIndex + 1;
lastEndIndexSet += odtRangeIncrementFactor;
rangeEndIndex = Math.round(lastEndIndexSet);
filesRangePerThread[processCounter] = new int[] { rangeStartIndex, rangeEndIndex };
}
}
else
{
filesRangePerThread[0] = new int[] { 0, filesCount - 1 };
}
return filesRangePerThread;
}
If you are dealing with I/O even with one processor multiple threads can work in parallel, because while one thread is waiting on read(byte[]) processor can run another thread.
Anyway, this is my solution
int nThreads = 2;
File[] files = new File[9];
int filesPerThread = files.length / nThreads;
class Task extends Thread {
List<File> list = new ArrayList<>();
// implement run here
}
Task task = new Task();
List<Task> tasks = new ArrayList<>();
tasks.add(task);
for (int i = 0; i < files.length; i++) {
if (task.list.size() == filesPerThread && files.length - i >= filesPerThread) {
task = new Task();
tasks.add(task);
}
task.list.add(files[i]);
}
for(Task t : tasks) {
System.out.println(t.list.size());
}
prints 4 5
Note that it will create 3 threads if you have 3 files and 5 processors

High Level Java Optimization

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.
My question is more of a High Level based optimization in design. Let's assume I have to do the following:
for a given string input, count the occurrence of each letter in the string.
this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.
So what is the high level design pattern that can be applied to this type of problem.
I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.
thanks in advance
Sam
p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site
The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.
I'll give you some thoughts on it anyway.
First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:
h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
h[w] = w in h ? h[w]++ : 1
}
Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:
Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file
These three steps go in a Unix pipeline. Let the OS do the work for you here.
Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.
Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.
ADDENDUM
The OP asked for an example of tokenizing the input in Java. Here is the easiest way:
import java.util.Scanner;
public class WordGenerator {
/**
* Tokenizes standard input into words, writing each word to standard output,
* on per line. Because it reads from standard input and writes to standard
* output, it can easily be used in a pipeline combined with sort, uniq, and
* any other such application.
*/
public static void main(String[] args) {
Scanner input = new Scanner(System.in);
while (input.hasNext()) {
System.out.println(input.next().toLowerCase());
}
}
}
Now here is an example of using it:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator
This outputs
hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.
You can combine this tokenizer with sort and uniq like so:
echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq
Yielding
hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo
Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:
Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));
And now
echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq
Yields
hey
moe
nyuk
soitenly
why
woo
There is a blank line in the output; I'll let you figure out how to whack it. :)
The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.
Some pseduo-code (may not compile)
HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
if (map.containsKey(c)) map.put(c, map.get(c) + 1);
else map.put(c, 1);
}
It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).
You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)
However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();
long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());
prints
Took 111 ms to count 139,715,647 characters
Even 11x times the number of words takes a fraction of a second.
A much longer parallel version is a little faster.
public static void main(String... args) throws InterruptedException, ExecutionException {
StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
words.append(Long.toString(rand.nextLong(), 36)).append(' ');
final String text = words.toString();
long start = System.nanoTime();
// start a thread pool to generate 4 tasks to count sections of the text.
final int nThreads = 4;
ExecutorService es = Executors.newFixedThreadPool(nThreads);
List<Future<int[]>> results = new ArrayList<Future<int[]>>();
int blockSize = (text.length() + nThreads - 1) / nThreads;
for (int i = 0; i < nThreads; i++) {
final int min = i * blockSize;
final int max = Math.min(min + blockSize, text.length());
results.add(es.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
int[] charCount = new int[Character.MAX_VALUE];
for (int j = min; j < max; j++)
charCount[text.charAt(j)]++;
return charCount;
}
}));
}
es.shutdown();
// combine the results.
int[] charCount = new int[Character.MAX_VALUE];
for (Future<int[]> resultFuture : results) {
int[] result = resultFuture.get();
for (int i = 0, resultLength = result.length; i < resultLength; i++) {
charCount[i] += result[i];
}
}
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}
prints
Took 45 ms to count 139,715,537 characters
But for a String with less than a million words its not likely to be worth it.
As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible.
If that means putting in a faster algorithm, do so, but at first, keep it simple.
For a small program like this, it won't be too hard.
The essential skill in performance tuning is not guessing.
Instead, let the program itself tell you what to fix.
This is my method.
For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.
You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.
On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.
You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.
I'll be glad of somebody point to another widely-used API.

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

Categories