I am trying to write a Java class to both send and read messages from a JMS queue using multiple threads to speed things up. I have the below code.
System.out.println("Sending messages");
long startTime = System.nanoTime();
Thread threads[] = new Thread[NumberOfThreads];
for (int i = 0; i < threads.length; i ++) {
threads[i] = new Thread() {
public void run() {
try {
for (int i = 0; i < NumberOfMessagesPerThread; i ++) {
sendMessage("Hello");
}
} catch (Exception e) {
e.printStackTrace();
}
}
};
threads[i].start();
}
//Block until all threads are done so we can get total time
for (Thread thread : threads) {
thread.join();
}
long endTime = System.nanoTime();
long duration = (endTime - startTime) / 1000000;
System.out.println("Done in " + duration + " ms");
This code works and sends however many messages to my JMS queue that I say (via NumberOfThreads and NumberOfMessagesPerThread). However, I am not convinced it is truly working multithreaded. For example, if I set threads to 10 and messages to 100 (so 1000 total messages), it takes the same time as 100 threads and 10 messages each. Even this code below takes the same time.
for (int i = 0; i < 1000; i ++) {
sendMessage("Hello");
}
Am I doing the threading right? I would expect the multithreaded code to be much faster than just a plain for loop.
Are you sharing a single connection (a single Producer) across all threads? If so then probably you are hitting some thread contention in there and you are limited to the speed of the socket connection between your producer and your broker. Of course, it will depend much on the jms implementation you are using (and if you are using asyncSends or not).
I will recommend you to repeat your tests using completely separate producers (although, you will lose the "queue" semantic in terms of ordering of messages, but I guess that is expected).
Also, I do not recommend running performance tests with numbers so high like 100 threads. Remember that your multithread capability it at some point limited by the amount of cores you machine has (more or less, you are having also a lot of IO in here so it might help to have a few more threads than cores, but a 100 is not really a good number in my opinion)
I would also review some of the comments in this post Single vs Multi-threaded JMS Producer
What is the implementation of 'sendMessage'. How are the connections, session, and producers being reused?
Related
If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540
For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.
In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?
I was struggling since 2 days to understand what is going on with c++ threadpool performance compared to a single thread, then I decided to do the same on java, this is when I noticed that the behaviour is same on c++ and java.. basically my code is simple straight forward.
package com.examples.threading
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class ThreadPool {
final static AtomicLong lookups = new AtomicLong(0);
final static AtomicLong totalTime = new AtomicLong(0);
public static class Task implements Runnable
{
int start = 0;
Task(int s) {
start = s;
}
#Override
public void run()
{
for (int j = start ; j < start + 3000; j++ ) {
long st = System.nanoTime();
boolean a = false;
long et = System.nanoTime();
totalTime.getAndAdd((et - st));
lookups.getAndAdd(1l);
}
}
}
public static void main(String[] args)
{
// change threads from 1 -> 100 then you will get different numbers
ExecutorService executor = Executors.newFixedThreadPool(1);
for (int i = 0; i <= 1000000; i++)
{
if (i % 3000 == 0) {
Task task = new Task(i);
executor.execute(task);
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
executor.shutdown();
while (!executor.isTerminated()) {
;
}
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
now same code when you run with different pool number say like 100 threads, the overall elapsed time will change.
one thread:
in time 36.91493612774451 lookups: 1002000
100 threads:
in time 141.47934530938124 lookups: 1002000
the question is, the code is same why the overall elapsed time is different what is exactly going on here..
You have a couple of obvious possibilities here.
One is that System.nanoTime may serialize internally, so even though each thread is making its call separately, it may internally execute those calls in sequence (and, for example, queue up calls as they come in). This is particularly likely when nanoTime directly accesses a hardware clock, such as on Windows (where it uses Windows' QueryPerformanceCounter).
Another point at which you get essentially sequential execution is your atomic variables. Even though you're using lock-free atomics, the basic fact is that each has to execute a read/modify/write as an atomic sequence. With locked variables, that's done by locking, then reading, modifying, writing, and unlocking. With lock-free, you eliminate some of the overhead in doing that, but you're still stuck with the fact that only one thread can successfully read, modify, and write a particular memory location at a given time.
In this case the only "work" each thread is doing is trivial, and the result is never used, so the optimizer can (and probably will) eliminate it entirely. So all you're really measuring is the time to read the clock and increment your variables.
To gain at least some of the speed back, you could (for one example) give thread thread its own lookups and totalTime variable. Then when all the threads finish, you can add together the values for the individual threads to get an overall total for each.
Preventing serialization of the timing is a little more difficult (to put it mildly). At least in the obvious design, each call to nanoTime directly accesses a hardware register, which (at least with most typical hardware) can only happen sequentially. It could be fixed at the hardware level (provide a high-frequency timer register that's directly readable per-core, guaranteed to be synced between cores). That's a somewhat non-trivial task, and (more importantly) most current hardware just doesn't include such a thing.
Other than that, do some meaningful work in each thread, so when you execute in multiple threads, you have something that can actually use the resources of your multiple CPUs/cores to run faster.
I wrote an application which reads all lines in text files and measure times. I`m wondering what will be the time of whole block.
For example if I start 2 threads at the same time:
for (int i = 0; i < 2; i++) {
t[i] = new Threads(args[j], 2);
j++;
}
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("TIME for block 1 of threads; "
+ (max(new long[]{t[0].getTime(),t[1].getTime()})));
Wait for them to stop processing the files and read operation times (by getTime). Is it good thinking for multithreading that in this case the time of block of threads, will be the maximum time got from thread? I think yes, because other threads will stop working by the time the thread with max time will stop.
Or maybe should I think in another way?
It's dangerous to argue about execution order when having multiple threads! E.g. If you run your code on a single core CPU, the threads will not really run in parallel, but sequentially, so the total run time for both threads is the sum of each thread's run time, not the maximum of both.
Fortunately, there is a very easy way to just measure this if you use an ExecutorService instead of directly using Threads (btw. this is always a good advice):
// 1. init executor
int numberOfThreads = 2; // or any other number
int numberOfTasks = numberOfThreads; // is this true in your case?
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
long startTime = System.currentTimeMillis();
// 2. execute tasks in parallel using executor
for(int i = 0; i < numberOfTasks; i++) {
executor.execute(new Task()); // Task is your implementation of Runnable
}
// 3. initiate shutdown and wait until all tasks are finished
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES); // we won't wait forever
// 4. measure time
long delta = System.currentTimeMillis() - startTime;
Now, delta holds the total running time of your tasks. You can play around with numberOfThreads to see if more or less threads give different results.
Important note: Reading from a file is not thread-safe in Java, so it is not allowed to share a Reader or InputStream between threads!
As far as my concern You can Use System class's static methods.
You can use it in starting of the block and end of the block and subtract the later one with earlier time.
those are :
System.currentTimeMillis(); // The current value of the system timer, in miliseconds.
or
System.nanoTime(); //The current value of the system timer, in nanoseconds.
You can use
Starting of block
long startTime = System.currentTimeMillis();
End of block
long endTime = System.currentTimeMillis()- startTime;
By this you can calculate.
I am trying to understand the ExecutorService in java. There is not much performance difference when I use 1 thread or 4 threads. I have a quad core CPU and I do not have any other process running.
ExecutorService exService = Executors.newFixedThreadPool(4);
exService.execute(new Test().new RunnableThread());
exService.awaitTermination(25, TimeUnit.SECONDS);
class RunnableThread implements Runnable {
#Override
public void run() {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
long cnt = 0;
for (cnt = 0; cnt < 999999999; cnt++) {
try {
for (long j = 0; j < 20; j++){
x += j;
}
} catch (Exception e) {
e.printStackTrace();
}
}
stopWatch.stop();
System.out.println(stopWatch.getTime());
}
}
If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?
Unfortunately, there is no magic in the allocation of workload to threads.
Every task runs on its own thread. It does not somehow automatically get transformed into concurrent execution paths.
If you have only one task, the remaining three threads will be idle.
Multiple threads only speed up things if you can split your workload into multiple tasks that can run concurrently (and you have to do that splitting yourself).
If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?
Yes, if you're actually running 4 concurrent tasks.
Currently, you have a single task that you are submitting to the executor. Let's say that it takes 10 seconds. Even if you have 4 cores and 4 threads, Java will not be able to parallelize a single task. However, if you submit 4 independent tasks (that have no memory or lock contention), then you will see all of them complete in those 10 seconds that it took the 1 task.
I'm writing conjugate-gradient method realization.
I use Java multi threading for matrix back-substitution.
Synchronization is made using CyclicBarrier, CountDownLatch.
Why it takes so much time to synchronize threads?
Are there other ways to do it?
code snippet
private void syncThreads() {
// barrier.await();
try {
barrier.await();
} catch (InterruptedException e) {
} catch (BrokenBarrierException e) {
}
}
You need to ensure that each thread spends more time doing useful work than it costs in overhead to pass a task to another thread.
Here is an example of where the overhead of passing a task to another thread far outweighs the benefits of using multiple threads.
final double[] results = new double[10*1000*1000];
{
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
results[i] = (double) i * i;
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per square%n", (double) time / results.length);
}
{
ExecutorService ex = Executors.newFixedThreadPool(4);
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
final int i2 = i;
ex.execute(new Runnable() {
#Override
public void run() {
results[i2] = i2 * i2;
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per square%n", (double) time / results.length);
}
prints
With one thread it took 1.4 ns per square
With four threads it took 715.6 ns per square
Using multiple threads is much worse.
However, increase the amount of work each thread does and
final double[] results = new double[10 * 1000 * 1000];
{
long start = System.nanoTime();
// using a plain loop.
for (int i = 0; i < results.length; i++) {
results[i] = Math.pow(i, 1.5);
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
{
int threads = 4;
ExecutorService ex = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int blockSize = results.length / threads;
// using a plain loop.
for (int i = 0; i < threads; i++) {
final int istart = i * blockSize;
final int iend = (i + 1) * blockSize;
ex.execute(new Runnable() {
#Override
public void run() {
for (int i = istart; i < iend; i++)
results[i] = Math.pow(i, 1.5);
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
prints
With one thread it took 287.6 ns per pow 1.5
With four threads it took 77.3 ns per pow 1.5
That's an almost 4x improvement.
How many threads are being used in total? That is likely the source of your problem. Using multiple threads will only really give a performance boost if:
Each task in the thread does some sort of blocking. For example, waiting on I/O. Using multiple threads in this case enables that blocking time to be used by other threads.
or You have multiple cores. If you have 4 cores or 4 CPUs, you can do 4 tasks simultaneously (or 4 threads).
It sounds like you are not blocking in the threads so my guess is you are using too many threads. If you are for example using 10 different threads to do the work at the same time but only have 2 cores, that would likely be much slower than running all of the tasks in sequence. Generally start the number of threads equal to your number of cores/CPUs. Increase the threads used slowly gaging the performance each time. This will give you the optimal thread count to use.
Perhaps you could try to implement to re-implement your code using fork/join from JDK 7 and see what it does?
The default creates a thread-pool with exactly the same amount of threads as you have cores in your system. If you choose the threshold for dividing your work into smaller chunks reasonably this will probably execute much more efficient.
You are most likely aware of this, but in case you aren't, please read up on Amdahl's Law. It gives the relationship between expected speedup of a program by using parallelism and the sequential segments of the program.
synchronizing across cores is much slower than on a single cored environment see if you can limit the jvm to 1 core (see this blog post)
or you can use a ExecuterorService and use invokeAll to run the parallel tasks