I have big list (up to 500 000) of some functions.
My task is to generate some graph for each function (it can be do independently from other functions) and dump output to the file (it can be several files).
The process of generating graphs can be time consuming.
I also have server with 40 physical cores and 128GB ram.
I have tried to implement parallel processing using java Threads/ExecutorPool, but it seems not to use processors all resources.
On some inputs the program takes up to 25 hours to run and only 10-15 cores are working according to htop.
So the second thing I've tried is to create 40 distinct processes (using Runtime.exec) and split the list among them.
This method uses processor all resources (100% load on all 40 cores) and speedups performance up to 5 times on previous example (it takes only 5 hours which is reasonable for my task).
But the problem of this method is that, each java process runs separately and consumes memory independently from others. Is some scenarios all 128gb of ram is consumed after 5 minutes of parallel work. One solution that I am using now is to call System.gc() for each process if Runtime.totalMemory > 2GB. This slows down overall performance a bit (8 hours on previous input) but lefts memory usage in reasonable boundaries.
But this configuration works only for my server. If you run it on the server with 40 core and 64GB run, you need to tune Runtime.totalMemory > 2GB condition.
So the question is what is the best way to avoid such aggressive memory consuming?
Is it normal practice to run separate processes to do parallel jobs?
Is there any other parallel method in Java (maybe fork/join?) which uses 100% physical resources of processor.
You don't need to call System.gc() explicitly! The JVM will do it automatically when needed, and almost always does it better. You should, however, set the max heap size (-Xmx) to a number that works well.
If your program won't scale further you have some kind of congestion. You can either analyse your program and your java- and system settings and figure out why, or run it as multiple processes. If each process is multi-threaded, then you may get better performance using 5-10 processes instead of 40.
Note that you may get higher performance with more than one thread per core. Fiddle around with 1-8 threads per core and see if throughput increases.
From your description it sounds like you have 500,000 completely independent items of work and that each work item doesn't really need a lot of memory. If that is true, then memory consumption isn't really an issue. As long as each process has enough memory so it doesn't have to gc very often then gc isn't going to affect the total execution time by much. Just make sure you don't have any dangling references to objects you no longer need.
One of the problems here: it is still very hard to understand how many threads, cores, ... are actually available.
My personal suggestion: there are several articles on the java specialist newsletter which do a very deep dive into this subject.
For example this one: http://www.javaspecialists.eu/archive/Issue135.html
or a more recent new, on "the number of available processors": http://www.javaspecialists.eu/archive/Issue220.html
Related
I have few questions and I have searched for below questions over the net but got more confused.
I have java application running on single machine with single JVM , 20 core machine with 20 GB ram with 20 java thread. ( single machine, single JVM, 20 core, 20 GB ram, 20 java threads )
I have java application running on single machine with 20 JVM, 20 core mahcine with 1 GB ram to each JVM with 1 JAVA thread to each process. ( single machine, 20 JVM, 20 core, 20 GB ram, 20 java threads )
From above points 1 and 2 which one is better and why in terms of
Performance of application
CPU utilization of machine
If CPU is not consumed fully even increasing the number of threads, then what to do in application or machine in order to utilize more CPU.
In which case context switching will be more.
In above 1 and 2 point, lets say each thread takes 10ms to complete the task then, in how much time all the tasks will be completeed in case 1 and 2. Please explain.
I have search and heard about CPU intensive and I/O intensive application in regard of CPU utilization. Can you please explain a bit as over net I got confused alot.
Generally it is recommened to have number of threads equal to number of cores. If i have more number of thread then cores then what will be the impact. Please explain.
You have a lot of different variables in play there and there isn't going to be a simple answer. I'll add some details below, but the tl;dr version is going to be "run some tests for your scenario and see what works the best".
Whether or not you should have one thread per cpu depends on your workload. As you've already apparently found, if your thread is cpu intensive (meaning, most of the time it is actively using the cpu), then one thread will come close to fully utilizing one cpu. In many real-world situations, though, it's likely that your thread may have to do other things (like wait for I/O), in which case it will not fully utilize a cpu (meaning you could run more than one thread per cpu).
also, in your two scenarios, you need to account for all the jvm overhead. Each jvm will have many of its own threads (most notably, the GC threads). These will add additional overhead to your cpu usage. If your code makes heavy use of the garbage collector (creating/discarding a large amount of garbage while working), it may be beneficial to have separate jvms per thread, but you'll need to account for the additional gc thread cpu usage. if your code does not make a lot of garbage, then using separate jvms (and many extra gc threads) may just be wasting resources.
Given all these variables, as well as the actual workload profile of your threads, you are unlikely to find the right answer with theory alone. Testing a variety of scenarios with real world data is the only way you will find the right mix for your application (it might end up being something in the middle like 4 jvms with 5 threads each).
I have a very large set of text files. The task was to calculate the document frequencies (number of document that contain a certain term) for all the terms (uniquely) inside this huge corpus. Simply starting from the first file and calculating everything in a serialized manner seemed to be a dumb thing to do (I admit I did it just to see how disastrous it is).
I realized that if I do this calculation in a Map-Reduce manner, meaning clustering my data into smaller pieces and in the end aggregating the results, I would get the results much faster.
My PC has 4 cores, so I decided to separate my data into 3 distinct subsets and feeding each subset to a separate thread waiting for all the threads to finish their work and passing their results to a another method to aggregate everything.
I tests it with a very small set of data, worked fined. Before I use the actual data, I tested it with a larger set to I can study its behaviour better. I started jvisualvm and htop to see how the cpu and memory is working. I can see that 3 threads are running and cpu cores are also busy. But the usage of these cores are rarely above 50%. This means that my application is not really using the full power of my PC. is this related to my code, or is this how it is supposed to be. My expectation was that each thread uses as much cpu core resource as possible.
I use Ubuntu.
Sounds to me that you have an IO bound application. You are spending more time in your individual threads reading the data from the disk then you are actually processing the information that is read.
You can test this by migrating your program to another system with a SSD to see if the CPU performance changes. You can also read in all of the files and then process them later to see if that changes the CPU curve during processing time. I suspect it will.
As already stated you're bottle-necked by something probably disk IO. Try separating the code that reads from disk from the code that processes the data, and use separate thread pools for each. Afterwards, a good way to quickly scale your thread pools to properly fit your resources is to use one of the Executors thread pools.
You are IO bound for a problem like this on a single machine, not CPU bound. Are you actively reading the files? Only if you had all the files in-memory would you start to saturate the CPU. That is why map-reduce is effective. It scales the total IO throughput more than CPU.
You can possibly speed up this quite a bit if you are on Linux and use tmpfs for storing the data in memory, instead on disk.
I have a program which runs (all day) tasks in parallel (no I/O in the task to be executed) so I have used Executors.newFixedThreadPool(poolSize) to implement it.
Initially I set the poolSize to Runtime.getRuntime().availableProcessors(), but I was a bit worried to use all the available cores since there are other processes running on the same PC (32 cores).
In particular I have ten other JVM running the same program (on different input data), so I'm a bit worried that there might be a lot of overhead in terms of threads switching amongst the available cores, which could slow down the overall calculations.
How shall I decide the size of the pool for each program / JVM?
Also, in my PC, there are other processes running all the time (Antivirus, Backup, etc.). Shall I take into account these as well?
Any advice is going to be dependent upon your particular circumstances. 10 JVMs on 32 cores would suggest 3 threads each (ignoring garbage collection threads, timer tasks etc...)
You also have other tasks running. The scheduler will ensure they're running, but do they have to be responsive ? More responsive than the JVM ? If you're running Linux/Unix then you can also make use of prioritisation (via nice) to ensure particular processes don't hog the CPU.
Finally you're running 10 JVMs. Will that cause paging ? If so, that will be slow and you may be better off running fewer JVMs in order to avoid consuming so much memory.
Just make sure that your key variables are exposed and configurable, and measure various scenarios in order to find the optimal one.
How shall I decide the size of the pool for each program / JVM?
You want the number of threads which will get you close to 99% utilisation and no more.
The simplest way to balance the work is to have the process running once, processing multiple files at concurrently and using just one thread pool. You can set up you process as a service if you need to start files via the command line.
If this is impossible for some reason, you will need to guesstimate how much the thread pools should be shrunk by. Try running one process and look at the utilisation. If one is say 40% then I suspect ten processes is over utilised by 400%. i.e then you might reduce the pool size by a factor of 4.
Unfortunately, this is a hard thing to know, as programs don't typically know what else is or might be going on on the same box.
the "easy" way out is to make the pool size configurable. this allows the user who controls the program/box to decide how many threads to allocate to your program (presumably using their knowledge of the general workload of the box).
a more complex solution would be to attempt to programmatically determine the current workload of the box and choose the pool size appropriately from that. the efficacy of this solution depends on how accurately you can determine the workload and potentially adapt as it changes over time.
Try grepping the processes, check top/task manager and performance monitors to verify if this implementation is actually affecting your machine.
This article seems to contain interesting info about what you are trying to implement:
http://www.ibm.com/developerworks/library/j-jtp0730/index.html
One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.
One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.
This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.
So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.
So - can we rule out context switching or too-many-threads as the problem?
General Details:
Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
There is only one single Java process on the machine, nothing else.
Two CPUs, under VMware.
8GB RAM
We don't have the option of running a profiler on the machine.
We don't have the option of upgrading the Java, nor the OS.
UPDATE
As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:
50% of the production server's workload: http://pastebin.com/GE2kGLkk
34% of the production server's workload: http://pastebin.com/V2PWq8CG
25% of the production server's workload: http://pastebin.com/0pxxK0Fu
CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.
There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?
UPDATE 2
Attaching the snapshot of the threads in three parts (pastebin limitations)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5
Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.
You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).
No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.
I think your constraints are unreasonable. Basically what you are saying is:
1.I can't change anything
2.I can't measure anything
Can you please speculate as to what my problem might be?
The real answer to this is that you need to hook a proper profiler to the application and you need to correlate what you see with CPU usage, Disk/Network I/O, and memory.
Remember the 80/20 rule of performance tuning. 80% will come from tuning your application. You might just have too much load for one VM instance and it could be time to consider solutions for scaling horizontally or vertically by giving more resources to the machine. It could be any one of the 3 billion JVM settings are not inline with your application's execution specifics.
I assume the 3000 thread pool came from the famous more threads = more concurrency = more performance theory. The real answer is a tuning change isn't worth anything unless you measure throughput and response time before/after the change and compared the results.
If you can't profile, I'd recommend taking a thread dump or two and seeing what your threads are doing. Your app doesn't have to stop to do it:
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/threads.html
http://java.net/projects/tda/
http://java.sys-con.com/node/1611555
So - can we rule out context switching or too-many-threads as the problem?
I think you concerns over thrashing are warranted. A thread pool with 3000 threads (700+ concurrent operations) on a 2 CPU VMware instance certainly seems like a problem that may be causing context switching overload and performance problems. Limiting the number of threads could give you a performance boost although determining the right number is going to be difficult and probably will use a lot of trial and error.
we need some proof of an issue.
I'm not sure the best way to answer but here are some ideas:
Watch the load average of the VM OS and the JVM. If you are seeing high load values (20+) then this is an indicator that there are too many things in the run queues.
Is there no way to simulate the load in a test environment so you can play with the thread pool numbers? If you run simulated load in a test environment with pool size of X and then run with X/2, you should be able to determine optimal values.
Can you compare high load times of day with lower load times of day? Can you graph number of responses to latency during these times to see if you can see a tipping point in terms of thrashing?
If you can simulate load then make sure you aren't just testing under the "drink from the fire hose" methodology. You need simulated load that you can dial up and down. Start at 10% and slowing increase simulated load while watching throughput and latency. You should be able to see the tipping points by watching for throughput flattening or otherwise deflecting.
Usually, context switching in threads is very cheap computationally, but when it involves this many threads... you just can't know. You say upgrading to Java 1.6 EE is out of the question, but what about some hardware upgrades ? It would probably provide a quick fix and shouldn't be that expensive...
e.g. run a profiler on a similar machine.
try a newer version of Java 6 or 7. (It may not make a difference, in which case don't bother upgrading production)
try Centos 6.x
try not using VMware.
try reducing the number of threads. You only have 8 cores.
You many find all or none of the above options make a difference, but you won't know until you have a system you can test on with a known/repeatable work load.
I wrote a very simple single threaded java application that simply iterates (a few times) over a list of Integer:s and calculates the sum. When I run this on my Linux machine (Intel X5677 3.46GHz quad-core), it takes the program about 5 seconds to finish. Same time if I restrict the jvm to two specific cores using taskset (which was quite expected, as the application is single threaded and the cpu load is < 0.1% on all cores). However – when I restrict the jvm to a single core, the program suddenly executes extreeemly slow and it takes 350+ seconds for it to finish. I could understand if it was only marginally slower when restricted to a single core as the jvm is running a few other threads in addition to the main thread, but I can’t understand this extreme difference. I ran the same program on an old laptop with a single core, and it executes in about 15 seconds. Does anyone understand what is going on here, or has anyone successfully restricted a jvm to a single core on multicore system without experiencing something like this?
Btw, I tried this with both hotspot 1.6.0_26-b03 and 1.7.0-b147 – same problem.
Many thanks
Yes, this seems counter-intuitive, but the simple solution would be to not do it. Let the JVM use 2 cores.
FWIW, my theory is that the JVM is looking at the number of cores that the operating system is reporting, assuming that it will be able to use all of them, and tuning itself based on that assumption. But the fact that you've pinned the JVM to a single core is making that tuning pessimal.
One possibility is that the JVM has turned on spin-locking. That is a strategy where a thread that can't immediately acquire a lock will "spin" (repeatedly testing the lock) for a period, rather than immediately rescheduling. This can work well if you've got multiple cores and the locks are held for a short time, but if there is only one core available then spinlocking is an anti-optimization.
(If this is the real cause of the problem, I believe there is a JVM option you can set to turn off spinlocks.)
This would be normal behaviour if you have two or more threads with an interdependence on each other. Imagine a program where two threads ping-ponging messages or data between them. When they are both running this can take 10 - 100 ns per ping-pong. When they have to context switch to run they can take 10 - 100 micro-seconds each. A 1000x increase I wouldn't find surprising.
If you want to limit the program to one core, you may have to re-write portions of it so its designed to run on one core efficiently.