I have approximately 40000 objects which might need to be repainted.
Most of them are not on the screen, so it seems that I could save a lot of work by doing the checks concurrently. But, my CPU never goes above 15% usage, so it seems that it is still only using one core. Have I implemented the threads correctly? If so, why aren't all my cores being used? And is there a better way which does utilize all my cores?
public void paintComponent(Graphics g)
{
super.paintComponent(g);
if (game.movables.size() > 10000)
{
final int size = game.drawables.size();
final Graphics gg = g;
Thread[] threads = new Thread[8];
for (int j = 0; j < 8; ++j)
{
final int n = j;
threads[j] = new Thread(new Runnable()
{
public void run()
{
Drawable drawMe;
int start = (size / 8) * n;
int end = (size / 8) * (n + 1);
if (n == 8) end = game.drawables.size(); // incase size
// % 8 != 0
for (int i = start; i < end; ++i)
{
drawMe = game.drawables.get(i);
if (drawMe.isOnScreen())
{
synchronized (gg)
{
drawMe.draw(gg);
}
}
}
}
});
threads[j].start();
}
try
{
for (int j = 0; j < 8; ++j)
threads[j].join();
}
catch (InterruptedException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
else
{
for (Drawable drawMe : game.drawables)
{
if (drawMe.isOnScreen())
{
drawMe.draw(g);
}
}
}
}
As has been pointed out, the synchronized (gg) is effectively serializing all the drawing, so you're probably going slower than single-threaded code due to thread creation and other overhead.
The main reason I'm writing however is that Swing, which this presumably is, is not thread safe. So the behavior of this program is not only likely to be bad, it's undefined.
Threading errors like this turn up as screwy behavior on some machines with some java runtime parameters and some graphics drivers. Been there. Done that. Not good.
JOGL will give you direct access to the GPU, the surest way to speed rendering.
To do this right, you might start by putting each drawMe in a (properly synchronized) list, then actually draw them in a loop after the joins are done. You can't speed the drawing (though if you've knocked out 99% of the drawMe's you've cut down the time needed dramatically), but if isOnScreen() is somewhat complicated, you'll get some real work out of your cores.
A ConcurrentLinkedQueue would save you the need to synchronize adds to the list.
The next step might be to use a blocking queue instead of a list, so the paint code could run in parallel with the visibility checks. With eight checks running, they should keep well ahead of the drawing. (But I think all the blocking queues either need synchronizing or do synching themselves. I'd skip this and stick with the CLQ and the first solution. Simpler and possibly faster.)
And (as Gene pointed out), everything Swing related starts on the EventQueue. Keep it there or life will get strange. Only your own code, not referencing the UI, should run in your threads.
Since you're already not drawing any objects that are off-screen, you're probably gaining very very little by doing what you're doing above.
I would also go as far as to say you're making it worse, but introducing synchronize which is slow and also introducing threads that cause context switches, which are expensive.
To improve performace you should perhaps look into using different drawing libraries, such as the Java2D drawing library, which is part of the JDK: http://java.sun.com/products/java-media/2D/index.jsp
I'm not sure how java will handle this, but other languages will blow up horribly and die if you reference something across scopes like you're doing with final int n (since it goes out of scope when the loop stops). Consider making it a field of the runnable object. Also, you're synchronizing on the graphics object while you're doing all of the real work. It's likely that you aren't getting any real performance increase from this. You might benefit from explicitly checking if the object is on the screen in parallel which is a read only operation, adding on-screen objects to a set or collection of some other sort, and then rendering sequentially.
Related
I am developing a small game, (Java, LibGdx) where the player fills cloze-style functions with predefined lines of code. The game would then compile the code and run a small test suite to verify that the function does the stuff it is supposed to.
Compiling and running the code already works, but I am faced with the problem of detecting infinite loops. Consider the following function:
// should compute the sum of [1 .. n]
public int foo(int n) {
int i = 0;
while (n > 0) {
i += n;
// this is the place where the player inserts one of many predefined lines of code
// the right one would be: n--;
// but the player could also insert something silly like: i++;
}
return i;
}
Please note that the functions actually used may be more complex and in general it is not possible to make sure that there cannot be any infinite loops.
Currently I am running the small test suite (provided for every function) in a Thread using an ExecutorService, setting a timeout to abort waiting in case the thread is stuck. The problem with this is, that the threads stuck in an endless loop will run forever in the background, which of course will at some point have a considerable impact on game performance.
// TestClass is the compiled class containing the function above and the corresponding test suite
Callable<Boolean> task = new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
// call the test suite
return new TestClass().test();
}
};
Future<Boolean> future = executorService.submit(task);
try {
Boolean result = future.get(100, TimeUnit.MILLISECONDS);
System.out.println("result: " + (result == null ? "null" : result.toString()));
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
} catch (TimeoutException e) {
e.printStackTrace();
future.cancel(true);
}
My question is now: How can I gracefully end the threads that accidentally spin inside an endless loop?
*EDIT To clarify why in this case, preventing infinite loops is not possible/feasable: The functions, their test suite and the lines to fill the gaps are loaded from disk. There will be hundrets of functions with at least two lines of code that could be inserted. The player can drag any line into any gap. The effort needed to make sure no combination of function gap/code line produces something that loops infinitely or even runs longer than the timeout grows exponentially with the number of functions. This quickly gets to the point where nobody has the time to check all of these combinations manually. Also, in general, determining, whether a function will finish in time is pretty much impossible because of the halting problem.
There is no such thing as "graceful termination" of a thread inside the same process. The terminated thread can leave inconsistent shared-memory state behind it.
You can either organize things so that each task is started in its own JVM, or make do with forceful termination using the deprecated Thread.stop() method.
Another option is inserting a check into the generated code, but this would require much more effort to implement properly.
The right way is to change the design and avoids never ending loops.
For the time being, inside your loop you could check if the thread is interrupted some way by: isInterrupted() or even isAlive().
And if it is you just exit.
It is not normal to have a never ending loop if it not wanted.
To solve the problem You can add a counter in the loop and if you reach a limit you can exit.
int counter = 0;
while (n > 0) {
counter++;
if (counter > THRESHOLD) {
break;
}
i += n;
// this is the place where the player inserts one of many predefined lines of code
// the right one would be: n--;
// but the player could also insert something silly like: i++;
}
I recently embarked on a project to simulate a collection of stellar bodies with the use of LWJGL. The solution required many loop iterations per frame to accomplish. The program calculates the forces exerted on each body by every other body. I did not wish to implement any form of limitations, such as tree algorithms. The program itself is able to simulate 800 bodies of random mass (between 1 and 50) at around 15 fps. Here is the original code for calculating, then updating the position of each body.
public void updateAllBodies() {
for (Body b : bodies) {
for (Body c : bodies) {
if (b != c) {
double[] force = b.getForceFromBody(c, G);
b.velocity[0] += force[0];
b.velocity[1] += force[1];
b.velocity[2] += force[2];
b.updatePosition();
}
}
}
}
Recently I came across the subject of parallels and streams. Seeing that my original code used only one thread, I thought I might be able to improve the performance by converting the array to a stream, and executing it with the use of
.parallelStream()
I don't know much about multi-threading and parallelism, but here is the resulting code that I came up with.
public void updateAllBodies() {
Arrays.asList(bodies).parallelStream().forEach(i -> {
for(Body b: bodies){
if (i != b){
double[] force = i.getForceFromBody(b, G);
i.velocity[0] += force[0];
i.velocity[1] += force[1];
i.velocity[2] += force[2];
i.updatePosition();
}
}
});
}
Unfortunately, when executed, this new code resulted in the same 15 fps as the old one. I was able to confirm that there were 3 concurrent threads running with
Thread.currentThread().getName();
At this point, I have no idea as to what the cause could be. lowering the number of bodies does show a drastic increase in frame rate. Any help will be greatly appreciated.
I cant seem to find a way to mark a comment as the answer to a post, so I will state that the best answer was given by softwarenwebie7331.
I was struggling since 2 days to understand what is going on with c++ threadpool performance compared to a single thread, then I decided to do the same on java, this is when I noticed that the behaviour is same on c++ and java.. basically my code is simple straight forward.
package com.examples.threading
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class ThreadPool {
final static AtomicLong lookups = new AtomicLong(0);
final static AtomicLong totalTime = new AtomicLong(0);
public static class Task implements Runnable
{
int start = 0;
Task(int s) {
start = s;
}
#Override
public void run()
{
for (int j = start ; j < start + 3000; j++ ) {
long st = System.nanoTime();
boolean a = false;
long et = System.nanoTime();
totalTime.getAndAdd((et - st));
lookups.getAndAdd(1l);
}
}
}
public static void main(String[] args)
{
// change threads from 1 -> 100 then you will get different numbers
ExecutorService executor = Executors.newFixedThreadPool(1);
for (int i = 0; i <= 1000000; i++)
{
if (i % 3000 == 0) {
Task task = new Task(i);
executor.execute(task);
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
executor.shutdown();
while (!executor.isTerminated()) {
;
}
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
now same code when you run with different pool number say like 100 threads, the overall elapsed time will change.
one thread:
in time 36.91493612774451 lookups: 1002000
100 threads:
in time 141.47934530938124 lookups: 1002000
the question is, the code is same why the overall elapsed time is different what is exactly going on here..
You have a couple of obvious possibilities here.
One is that System.nanoTime may serialize internally, so even though each thread is making its call separately, it may internally execute those calls in sequence (and, for example, queue up calls as they come in). This is particularly likely when nanoTime directly accesses a hardware clock, such as on Windows (where it uses Windows' QueryPerformanceCounter).
Another point at which you get essentially sequential execution is your atomic variables. Even though you're using lock-free atomics, the basic fact is that each has to execute a read/modify/write as an atomic sequence. With locked variables, that's done by locking, then reading, modifying, writing, and unlocking. With lock-free, you eliminate some of the overhead in doing that, but you're still stuck with the fact that only one thread can successfully read, modify, and write a particular memory location at a given time.
In this case the only "work" each thread is doing is trivial, and the result is never used, so the optimizer can (and probably will) eliminate it entirely. So all you're really measuring is the time to read the clock and increment your variables.
To gain at least some of the speed back, you could (for one example) give thread thread its own lookups and totalTime variable. Then when all the threads finish, you can add together the values for the individual threads to get an overall total for each.
Preventing serialization of the timing is a little more difficult (to put it mildly). At least in the obvious design, each call to nanoTime directly accesses a hardware register, which (at least with most typical hardware) can only happen sequentially. It could be fixed at the hardware level (provide a high-frequency timer register that's directly readable per-core, guaranteed to be synced between cores). That's a somewhat non-trivial task, and (more importantly) most current hardware just doesn't include such a thing.
Other than that, do some meaningful work in each thread, so when you execute in multiple threads, you have something that can actually use the resources of your multiple CPUs/cores to run faster.
ok so I did my research there is plenty of questions here on thread synchronization but non of them really hit the point. I am currently working in Opencv, I get a frame from the camera containing vehicles, remove the background and track the vehicles, but before I do this I do some pre-processing and post-processing like removing noise with blur, all this runs in a single thread and it works great but here comes an issue, I now want to read number plates, for this i need a higher resolution frame otherwise for every frame I will not detect a single plate, but as soon as i increase my frame size I get a performance hit,my threads slows down to the point that my program no longer qualifies to be a real time system.
So I thought of adding more threads to my scene each to specialize on one task
here is a list of my tasks
//recieves fame from Camera
1. preprocess
//recieves a Frame from preprocess and removes the background
2. remove background
//recieves a Frame from backgroundremover and tracks the vehicles
3. postprocess
If I run the threads one by one am thinking it will still be slow instead I thought or running the threads simultenously but the issues it they use the same objects, declaring them volatile will mean threads waiting for the thread with lock to complete for it to use the object which will mean a slow system again so my question is how can I run these threads simultaneously without having to wait for others?
I have looked at a dozen of multithreading techniques in Java but finding it really hard to come up with a way of making this work.
So far I have looked at
1. Thread synchronization using the keyword volatile
2. Thread synchronization using the keyword synchronized
3. Multiple thread locks using a lock object
4. Using threadpools
5. Using the Countdown Latch
6. Wait and motify
7. Using Semaphores(which seemed like a good idea).
Here is the code I want to break down into those threads
public void VideoProcessor()
{
videProcessorThread = new Thread(new Runnable()
{
#Override
public void run()
{
try{
int i = 0;
while (isPlaying() && isMainScreenONOFF()) {
camera.read(frame);
//set default and max frame speed
camera.set(Videoio.CAP_PROP_FPS, 25);
//get frame speed just incase it did not set
fps = camera.get(Videoio.CAP_PROP_FPS);
//if(frame.height() > imgHeight || frame.width() > imgWidth)
Imgproc.resize(frame, frame, frameSize);
//check if to convert or not
if(getblackAndWhite())
Imgproc.cvtColor(frame, frame, Imgproc.COLOR_RGB2GRAY);
imag = frame.clone();
if(rOI){
//incase user adjusted the lines we try calculate there new sizes
adjustLinesPositionAndSize(xAxisSlider.getValue(), yAxisSlider.getValue());
//then we continue and draw the lines
if(!roadIdentified)
roadTypeIdentifier(getPointA1(), getPointA2());
}
viewClass.updateCarCounter(tracker.getCountAB(), tracker.getCountBA());
if (i == 0) {
// jFrame.setSize(FRAME_WIDTH, FRAME_HEIGHT);
diffFrame = new Mat(outbox.size(), CvType.CV_8UC1);
diffFrame = outbox.clone();
}
if (i == 1) {
diffFrame = new Mat(frame.size(), CvType.CV_8UC1);
removeBackground(frame, diffFrame, mBGSub, thresHold.getValue(), learningRate.getValue());
frame = diffFrame.clone();
array = detectionContours(diffFrame, maximumBlob.getValue(), minimumBlob.getValue());
Vector<VehicleTrack> detections = new Vector<>();
Iterator<Rect> it = array.iterator();
while (it.hasNext()) {
Rect obj = it.next();
int ObjectCenterX = (int) ((obj.tl().x + obj.br().x) / 2);
int ObjectCenterY = (int) ((obj.tl().y + obj.br().y) / 2);
//try counter
//add centroid and bounding rectangle
Point pt = new Point(ObjectCenterX, ObjectCenterY);
VehicleTrack track = new VehicleTrack(frame, pt, obj);
detections.add(track);
}
if (array.size() > 0) {
tracker.update(array, detections, imag);
Iterator<Rect> it3 = array.iterator();
while (it3.hasNext()) {
Rect obj = it3.next();
int ObjectCenterX = (int) ((obj.tl().x + obj.br().x) / 2);
int ObjectCenterY = (int) ((obj.tl().y + obj.br().y) / 2);
Point pt = null;
pt = new Point(ObjectCenterX, ObjectCenterY);
Imgproc.rectangle(imag, obj.br(), obj.tl(), new Scalar(0, 255, 0), 2);
Imgproc.circle(imag, pt, 1, new Scalar(0, 0, 255), 2);
//count and eleminate counted
tracker.removeCounted(tracker.tracks);
}
} else if (array.size() == 0) {
tracker.updateKalman(imag, detections);
}
}
i = 1;
//Convert Image and display to View
displayVideo();
}
//if error occur or video finishes
Image image = new Image("/assets/eyeMain.png");
viewClass.updateMainImageView(image);
}catch(Exception e)
{
e.printStackTrace();
System.out.println("Video Stopped Unexpectedly");
}
//thread is done
}
});videProcessorThread.start();
}
As no-one else has replied, I'll have a go.
You've already covered the main technical aspects in your questions (locking, synchronisation etc). Whichever way you look at it, there is no general solution to designing a multi-threaded system. If you have threads accessing the same objects you need to design your synchronisation and you can get threads blocking each other, slowing everything down.
The first thing to do is to do some performance profiling as there is no point making things run in parallel if they are not slowing things down.
That said, I think there are three approaches you could take in your case.
Have a single thread process each frame but have a pool of threads processing frames in parallel. If it takes a second to process a frame and you have 25fps you'd need at least 25 threads to keep up with the frame rate. You'd always be about a second behind real time but you should be able to keep up with the frame rate.
A typical way to implement this would be to put the incoming frames in a queue. You then have a pool of threads reading the latest frame from the queue and processing it. The downside of this design is that you can't guarantee in which order you would get the results of the processing so you might need to add some more logic around sorting the results.
The advantages are that:
There is very little contention, just around getting the frames off the queue, and that should be minimal
It is easy to tune and scale by adjusting the number of threads. It could even run on multiple machines, depending on how easy it is to move frames between machines
You avoid the overhead of creating a new thread each time as each thread processing one frame after another
It is easy to monitor as you can just look at the size of the queue
Error handling can be implemented easily, eg use ActiveMQ to re-queue a frame if a thread crashes.
Run parts of your algorithm in parallel. The way you've written it (pre-process, process, post-process), I don't see this is suitable as you can't do the post processing at the same time as the pre-processing. However, if you can express your algorithm in steps that can be run in parallel then it might work.
Try and run specific parts of your code in parallel. Looking at the code you posted the iterators are the obvious choice. Is there any reason not to run the iterator loops in parallel? If you can, experiment with the Java parallel streams to see if that bring any performance gains.
Personally I'd try option 1 first as its quick and simple.
I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.