Running java application parallely on multicore cluster nodes

Running java application parallely on multicore cluster nodes - java

I have a Java program that calculates semantic similarity between two documents. The program retrieves documents from a specified file system and calculates similarity. There are around 2,00,000 such docs.
I have created 10 threads for this task and I have assigned data chucks to each of the thread. For ex. documents 1-20000 for the first thread and 20001-40000 to the next thread and so on.
Currently I am running the above program in a 8 CPU-core machine. Its taking a lot of time to finish the calculations.
I want to run the program on a 5 node Linux cluster where each node has 64 cores.
Are there any frameworks available in Java like EXECUTOR Framework which can do this task?
Is There a way on how to calculate the maximum number of threads that one can spawn?
Any pointers on how to resolve this or do it better would be appreciated.

Are there any frameworks available in Java like EXECUTOR Framework which can do this task?
I suggest you to take a look at the Akka framework for writing powerful concurrent & distributed applications. Akka uses the Actor Model together with Software Transactional Memory to raise the abstraction level and provide a better platform to build correct concurrent and scalable applications.
Take a look at the step by step tutorial that gives more information about how to build a distributed application using Akka framework.
In general, distributed applications are built in Java using Java-RMI that internally uses Java's inbuilt serialization to pass the objects between the nodes.
Is There a way on how to calculate the maximum number of threads that one can spawn?
The simple rule that we use is, set to higher value than the available logical cores in the system. How much higher value depends on the type of operations that we do. For example, if the computation involves IO communication then set the number of threads to 2 * available logical cores (Not the physical cores).
Other ideas that we use,
Measure the CPU utilization by increasing the number of threads one by one and stop when the CPU utilization reaches close to 90-100%
Measure the throughput and stop the point at which throughput stays or starts to degrade

Java's Fork/Join framework is your friend. As the opening statement of this framework says:
The fork/join framework is an implementation of the ExecutorService
interface that helps you take advantage of multiple processors. It is
designed for work that can be broken into smaller pieces recursively.
The goal is to use all the available processing power to enhance the
performance of your application.
On how many threads you can spawn - I think there is no such hard and fast rule, it depends. So can try to start with a number like 5 or so, and then keep on increasing or decreasing based on result.
Also, you can analyze your existing maximum and minimum number of threads, and contrast it with CPU utilization etc., and proceed like this to understand how your system is behaving. If your application is deployed in application server, then check its threading model and what they say about thread capacity.

Related

What would happen if I run parallel code in a multi-threading server program? [duplicate]

This question already has answers here:
What is the difference between concurrency and parallelism?
(40 answers)
Closed 3 years ago.
I understand roughly the difference between parallel computing and concurrent computing. Please correct me if I am wrong.
Parallel Computing
A system is said to be parallel if it can support two or more
actions executing simultaneous. In parallel programming, efficiency is
the major concern.
Concurrent Computing
A system is said to be concurrent if it can support two or more
actions in progress at the same time. Though, multiple actions are not
necessarily to be executed simultaneously in concurrent programming.
In concurrent programming, modularity, responsiveness and
maintainability are important
I am wondering what is going to happen if I execute parallel programming code inside a multi-threaded program? e.g. using Java's parallel Stream in a multi-threaded server program.
Would the program actually be more efficient?
My initial thought is that it might not be a good idea, since a somehow-optimized multi-threading program should already have the threads occupied. Parallelism here may give extra overhead.

The crucial difference between concurrency and parallelism is that concurrency is about dealing with a lot of things at same time (gives the illusion of simultaneity) or handling concurrent events essentially hiding latency. On the contrary, parallelism is about doing a lot of things at the same time for increasing the speed.
Both have different requirement and use case.
Parallelism is used to achieve run time performance and efficiency , yes it will add some additional overhead to system (Cpu,ram etc) because of its nature. but this is significantly used concept in today's multi core technology.

I am wondering what is going to happen if I execute parallel
programming code inside a multi-threaded program? e.g. using Java's
parallel Stream in a multi-threaded server program.
Based on my limited knowledge of Java runtime, every program is already multithreaded, the application entry point is the main thread which runs along side other run time threads (gc).
Suppose your application spawns two threads, and in one of those threads a parallelStream is created. It looks like the parallelStreams api use a ForkJoinPool.commonPool which starts NUM_PROCESSES - 1 threads. At this point your application may have more threads than CPUs so if your parallelStream computation is CPU bound than you're already oversubscribed on threads -> CPU.
https://stackoverflow.com/a/21172732/594589
I'm not familiar with java but it's interesting that parallelStream shares the same thread pool. So if your program spawned another thread and started another parallelStream, the second parallelStream would share the underlying thread pool threads with the first!
In my experiences I've found it's important to consider:
The type of workload your application is performing (CPU vs IO)
The type of concurrency primitives available (threads, processes, green threads, epoll aysyncio, etc)
Your system resources (ie #CPU's available)
How your applications concurrency primitives map to the underlying OS resources
The # of concurrency primitives that your application has at any given time
Would the program actually be more efficient?
It completely depends, and the only for sure answer is to benchmark on your target architecture/system with the two solutions.
In my experiences reasoning about complex concurrency beyond basic patterns becomes much of a shot in the dark. I believe that this is where the saying:
Make it work, make it right, make it fast.
-- Kent Beck
comes from. In this case make sure that your program is concurrent safe (make it right) and free of deadlocks. And then begin testing, benchmarking and running experiments.
In my limited personal experiences I have found analysis to largely fall apart beyond characterizing your applications workload (CPU vs IO) and finding a way to model it so you can scale out to utilize your systems full resources in a configurable benchmark able way.

What is the best way to create thread pool in java

I am trying to use executor services in one of my application where I have created a pool of 8 since my machine has 4 cores and as per my recent searches, I found that only 2 active threads can work on a core.
When I checked number of cores via java also found the value to 4
int cores = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(cores*2);
Please suggest am I doing correctly because I dont see any much worth of creating a pool of 500 when my cpu can handle only 8 threads.

Hyper-threading
You should read up on hyper-threading technology. That term is specifically the brand-name used by Intel for its proprietary simultaneous multithreading (SMT) implementation, but is also used more generally for SMT.
An over-simplified explanation:
In a conventional CPU, switching between threads is quite expensive. Registers are the holding place for several pieces of data actually being worked on by the CPU core. The values in those registers have to be swapped out when switching threads. Likewise caches (holding places for data, slower than registers but faster than RAM) may also be cleared. All this takes time.
Hyper-threading design in a CPU adds a duplicate set of registers. Each core having a double set of registers means it can switch between threads without swapping out the values in the registers. The switch between threads is much faster, so much so that the CPU lies to the operating system, reporting each core as a pair of (virtual) cores. So a 4-core chip will appear as 8 cores, for example.
I found that only 2 active threads can work on a core
Be aware that switching threads still has some expense, just a much lower expense. A hyper-threaded CPU core is still executing only one thread at a time. Being hyper-threaded means the switching between threads is easier & faster.
For use an a machine where the threads are often in a holding pattern, waiting for some external function to complete such as a call out over the network, hyper-threading makes much sense. For applications where the cores are likely to be doing the kind of work that is CPU-bound such as number-crunching, simulations, scientific data analysis, then hyper-threading may not be as useful. So on machines doing such work, the sysadmin may decide to disable hyper-threading, so 4 cores are really just 4 cores for example. Also, because of the recent security vulnerabilities related to hyper-threading technology, some sysadmins may decide to disable hyper-threading.
Thread pools
creating a pool of 500 when my cpu can handle only 8 threads.
The sizing of a thread pool depends on the behavior of your application(s). If have CPU-bound apps, then you certainly want to limit the number of such CPU-intensive threads to less than the number of actual or virtual cores. If your apps are not CPU bound, if they are often doing file I/O, network I/O, or other activities where they are often do nothing while waiting on other resources, then you can have more threads than cores. If your threads often sit idle doing nothing at all, then you can have even more threads going.
What is the best way to create thread pool in java
There are no specific rules to help you here. You must make an educated guess initially, then monitor your app and host machine in production. And so you may want to have a way to set the number of threads being used in your apps during runtime rather than hard-coding a number. For example, use preference settings or use JMX. Learn to use profiling tools such as Java Flight Recorder and Mission Control; both are now bundled with OpenJDK-based distributions of Java. If you are deploying to a system supporting DTrace (macOS, BSD, etc.), that may help as well.
Within an app with different kinds of workloads going on in various parts of functionality, it may make sense to maintain multiple thread pools. Use a pool with a very small number of threads for the CPU-intensive work, and a pool with a larger number of threads for CPU-non-intensive work. The Executors framework within modern Java helps make this easy.
Take into account all your apps you may be deploying to a machine. And take account of all the other apps running on that machine. And take account of the CPU needs of the operating system. After all this, you may find that some of your thread pools should be set to only one or two threads at most.
Tricky stuff
Thread-safety is very tricky complicated work. When sharing resources between threads (variables, files, etc.) you must educate yourself about the issues involved in protecting those resources from abuse.
Required reading: Java Concurrency in Practice by Brian Goetz et al.

Go to System properties and check how many cores (physical cores) and logical processors (virtual cores) are there.
For example :
if your system has n cores and n logical processors. This means your processor don't have a support for hyperthreading.
if your system has n cores and n x 2 logical processors. This means your processor have a support for hyperthreading. You can execute n * 2 threads in parallel.
Note : Suppose you have hyperthreading support. Now, you have 8 cores and 16 virtual cores.
Then, the processor will give a good throughput up to 16 threads. If you increase the thread pool more than 16 threads, the throughput will become uniform and will not change too much.

java concurrency - Is Instruction Level Parallelism(ILP) used underhood

Concurrency in Java or some similar languages is achieved through threads or task level parallelism. But under the hood does the hardware or run time also use ILP to achieve best performance.
Little further elaboration: In a multi core processor (say 4 per system) with multiple threads (say 2 per core) ( i.e total 8 threads per system), a java thread is executed in one of the several (8 in this case) processor threads. But if the system determines that all or several other threads are doing nothing but staying ideal, can the hardware or runtime do any legal re-orderings and execute them in other threads on same or other cores and fetch the results back(or in to main memory)
I am bothered about does java implementation allow this or even otherwise it is up to hardware to handle this independently even with out the JVM even knowing anything.

It's a little unclear what you're asking, but I don't think it has much to do with Java.
I think you're talking about (at least) two different things:
"ILP" is generally used to refer to a set of techniques that occur within a single core (such as pipelining and branch prediction), and has little to do with threading or multi-core. These techniques are transparent implementation details of the CPU, and typically not exposed in a way that you (or the runtime) can interact with directly.
Threads are swapped on and off cores by the kernel scheduler if they become blocked (and even if they're not, to ensure fairness).

Advantage of fewer processors in ForkJoin

I have been seeing how changing the number of processors used in a program that uses Java's ForkJoin for parallelization affects other processes going on with my computer (such as web browsing, typing a paper, opening other files). I had thought before that using fewer processors would leave more processing power for other tasks that my computer is doing, but I do not notice any difference in these tasks when I use one processor versus the maximum number of processors. Why would one ever want to use one processor rather than the maximum number of processors?

WorkManager and a high workload

I'm working on an application which interacts with hundreds of devices across a network. The type of work being committed requires a lot of the concurrent threads (mostly because each of them requires network interaction and does so separately, but for other reasons as well). At the moment, we're in the area of requiring about 20-30 threads per device being interacted with.
A simple calculation puts this at thousands of threads, even up to 10,000 threads. If we put aside the CPU penalty for thread-switching, etc., how many threads can Java 5 running on CentOS 64-bit handle? Is this just a matter of RAM or is there anything else we should consider?
Thanks!

In such situation its always recomended to use Thread Pooling.
Thread pools address two different problems: they usually provide improved performance when executing large numbers of asynchronous tasks, due to reduced per-task invocation overhead, and they provide a means of bounding and managing the resources, including threads, consumed when executing a collection of tasks. Each ThreadPoolExecutor also maintains some basic statistics, such as the number of completed tasks.
ThreadPoolExecutor is class you should be using.
http://www.javamex.com/tutorials/threads/ThreadPoolExecutor.shtml

I think up to 65k threads is OK with java, the only thing you need to consider is stack space - linux by default allocates 48k per thread/process as stack space, which is wasteful for java (which doesn't have stack-allocated objects, hence uses much less stack space). This will easily use 500 megs for 10k threads.

If this is really an absolute requirement, you might wan't to have a look at a language that's specifically build to deal with this level of concurrent threads, such as erlang.

Like others are suggesting, you should use NIO. We had an app that used a lot (but much less than you are planning) of threads (e.g. 1,000 ) and it was already very inefficient. If you have to use THAT much threads, it's definitely time to consider the use of NIO.
For network, if your apps are using HTTP, one very easy tool would be Async-HTTP-client by 2 very famous author in this field.
If you use a different protocol, using the underlying implementation of Async-HTTP-client (netty) would be recommendable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.