My question can be naive and since i am not able to find any explanation over web and SO which can explain in simple enough to understand for a beginner that's why posting this question.
Can anyone help me understand what's the difference b/w spark executor vs instances vs cores vs cpus?
OR
more formally "spark.executor.instances" vs "spark.executor.cores" vs "spark.task.cpus"
Please guide me or atleast point me to any resource which can explain these things in simpler manner to a starter
Simply put,
spark.executor.instances is the number of executors you want in your Spark Application.
spark.executor.cores is the number of cores you want in each of your executors. You can think of these as individual threads in the same process(executor) that are capable of processing a Task
So, when your spark app runs, you can have a total of spark.executor.instances x spark.executor.cores workers, working together, in parallel, on the tasks.
Related
This question already has answers here:
What is the difference between concurrency and parallelism?
(40 answers)
Closed 3 years ago.
I understand roughly the difference between parallel computing and concurrent computing. Please correct me if I am wrong.
Parallel Computing
A system is said to be parallel if it can support two or more
actions executing simultaneous. In parallel programming, efficiency is
the major concern.
Concurrent Computing
A system is said to be concurrent if it can support two or more
actions in progress at the same time. Though, multiple actions are not
necessarily to be executed simultaneously in concurrent programming.
In concurrent programming, modularity, responsiveness and
maintainability are important
I am wondering what is going to happen if I execute parallel programming code inside a multi-threaded program? e.g. using Java's parallel Stream in a multi-threaded server program.
Would the program actually be more efficient?
My initial thought is that it might not be a good idea, since a somehow-optimized multi-threading program should already have the threads occupied. Parallelism here may give extra overhead.
The crucial difference between concurrency and parallelism is that concurrency is about dealing with a lot of things at same time (gives the illusion of simultaneity) or handling concurrent events essentially hiding latency. On the contrary, parallelism is about doing a lot of things at the same time for increasing the speed.
Both have different requirement and use case.
Parallelism is used to achieve run time performance and efficiency , yes it will add some additional overhead to system (Cpu,ram etc) because of its nature. but this is significantly used concept in today's multi core technology.
I am wondering what is going to happen if I execute parallel
programming code inside a multi-threaded program? e.g. using Java's
parallel Stream in a multi-threaded server program.
Based on my limited knowledge of Java runtime, every program is already multithreaded, the application entry point is the main thread which runs along side other run time threads (gc).
Suppose your application spawns two threads, and in one of those threads a parallelStream is created. It looks like the parallelStreams api use a ForkJoinPool.commonPool which starts NUM_PROCESSES - 1 threads. At this point your application may have more threads than CPUs so if your parallelStream computation is CPU bound than you're already oversubscribed on threads -> CPU.
https://stackoverflow.com/a/21172732/594589
I'm not familiar with java but it's interesting that parallelStream shares the same thread pool. So if your program spawned another thread and started another parallelStream, the second parallelStream would share the underlying thread pool threads with the first!
In my experiences I've found it's important to consider:
The type of workload your application is performing (CPU vs IO)
The type of concurrency primitives available (threads, processes, green threads, epoll aysyncio, etc)
Your system resources (ie #CPU's available)
How your applications concurrency primitives map to the underlying OS resources
The # of concurrency primitives that your application has at any given time
Would the program actually be more efficient?
It completely depends, and the only for sure answer is to benchmark on your target architecture/system with the two solutions.
In my experiences reasoning about complex concurrency beyond basic patterns becomes much of a shot in the dark. I believe that this is where the saying:
Make it work, make it right, make it fast.
-- Kent Beck
comes from. In this case make sure that your program is concurrent safe (make it right) and free of deadlocks. And then begin testing, benchmarking and running experiments.
In my limited personal experiences I have found analysis to largely fall apart beyond characterizing your applications workload (CPU vs IO) and finding a way to model it so you can scale out to utilize your systems full resources in a configurable benchmark able way.
Is there a way to process different sparkSQL queries(read queries with different filters and groupbys) on a static dataset, being received from the front-end, in parallel and not in a FIFO manner, so that the users will not have to wait in a queue?
One way is to submit the queries from different threads of a thread pool but then wouldn't concurrent threads compete for the same resources i.e. the RDDs?
Source
Is there a more efficient way to achieve this using spark or any other big data framework?
Currently, I'm using sparkSQL and the data is stored in parquet format(200GB)
I assume you mean different users submitting their own programs or spark-shell activities and not parallelism within the same application per se.
That being so, Fair Scheduler Pools or Spark Dynamic Resource Allocation would be the best bets. All to be found here https://spark.apache.org/docs/latest/job-scheduling.html
This area is somewhat hard to follow, as there is the notion of as follows:
... " Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. ".
One can find opposing statements on Stack Overflow regarding this point. Apache Ignite is what is meant here, that may well serve you as well.
I have a Java program that calculates semantic similarity between two documents. The program retrieves documents from a specified file system and calculates similarity. There are around 2,00,000 such docs.
I have created 10 threads for this task and I have assigned data chucks to each of the thread. For ex. documents 1-20000 for the first thread and 20001-40000 to the next thread and so on.
Currently I am running the above program in a 8 CPU-core machine. Its taking a lot of time to finish the calculations.
I want to run the program on a 5 node Linux cluster where each node has 64 cores.
Are there any frameworks available in Java like EXECUTOR Framework which can do this task?
Is There a way on how to calculate the maximum number of threads that one can spawn?
Any pointers on how to resolve this or do it better would be appreciated.
Are there any frameworks available in Java like EXECUTOR Framework which can do this task?
I suggest you to take a look at the Akka framework for writing powerful concurrent & distributed applications. Akka uses the Actor Model together with Software Transactional Memory to raise the abstraction level and provide a better platform to build correct concurrent and scalable applications.
Take a look at the step by step tutorial that gives more information about how to build a distributed application using Akka framework.
In general, distributed applications are built in Java using Java-RMI that internally uses Java's inbuilt serialization to pass the objects between the nodes.
Is There a way on how to calculate the maximum number of threads that one can spawn?
The simple rule that we use is, set to higher value than the available logical cores in the system. How much higher value depends on the type of operations that we do. For example, if the computation involves IO communication then set the number of threads to 2 * available logical cores (Not the physical cores).
Other ideas that we use,
Measure the CPU utilization by increasing the number of threads one by one and stop when the CPU utilization reaches close to 90-100%
Measure the throughput and stop the point at which throughput stays or starts to degrade
Java's Fork/Join framework is your friend. As the opening statement of this framework says:
The fork/join framework is an implementation of the ExecutorService
interface that helps you take advantage of multiple processors. It is
designed for work that can be broken into smaller pieces recursively.
The goal is to use all the available processing power to enhance the
performance of your application.
On how many threads you can spawn - I think there is no such hard and fast rule, it depends. So can try to start with a number like 5 or so, and then keep on increasing or decreasing based on result.
Also, you can analyze your existing maximum and minimum number of threads, and contrast it with CPU utilization etc., and proceed like this to understand how your system is behaving. If your application is deployed in application server, then check its threading model and what they say about thread capacity.
This question already has answers here:
Is it possible to force an existing Java application to use no more than x cores?
(2 answers)
Closed 9 years ago.
Is there a way to code Java application to use only 2 CPU cores of the CPU. For example I want to set a limit of the CPU utilization. Is this possible in Java?
I believe that's something that needs to be handled at the OS level, thus can't be controlled through java code. Take a look at this post. I hope it helps.
Is it possible to force an existing Java application to use no more than x cores?
If you use a pool of threads and limit the number of threads to 2, no more than 2 cores can be used at the same time. However this is a poor way of limiting CPU usage, and may not even be possible in your scenario (it depends on how many concurrent tasks your application must run).
You can't control this directly. But if you limit the number of threads used to two it will always use only up to two cores.
But note that
a) the actual cores used might vary since the scheduler might move threads from one core to the other
b) there might be other threads started by the JVM (e.g. for garbage collection), that you might not expect from just looking at your application.
You can not assign Threads to cores, therefore you can not do that.
What you can do is to use JNI for such task or use the priority mechanism for Threads.
I am doing some research on language implementations on multicore platforms. Currently, I am trying to figure out a couple of things:
How does a JVM implementation map a java.lang.Thread to an OS native thread?
For instance say Open JDK, I think I even do not know what parts I should take a look to read more about this. Is there any document describing how native features are implemented? Since there are parts in java.lang.Thread that are native, I assumed that maybe some more internal parts are coded in the native parts.
Taking this to multicore, how is this mapping done for multicore? How threads are mapped to different cores to run simultaneously? I know that there is ExecutorService implementation that we can use to take advantage of multicore features. Here can come a consequence of the previous answers: If the OS native threads are responsible for work distribution and thread scheduling, then is it true to say that what JVM does through ThreadPool and ExecutorService is only creating threads and submitting tasks to them?
I'd be thankful for your answers and also if I am on the right track on topic.
For instance say Open JDK, I think I even do not know what parts I should take a look to read more about this.
You should start by looking at the parts of the source code that are coded in C++. A C / C++ IDE might help you with the codebase exploration.
Taking this to multicore, how is this mapping done for multicore? How threads are mapped to different cores to run simultaneously?
I'm pretty sure that the operating system takes care of that aspect, not the JVM.
... is it true to say that what JVM does through ThreadPool and ExecutorService is only creating threads and submitting tasks to them?
AFAIK, yes.