Running a standalone Hadoop application on multiple CPU cores

Running a standalone Hadoop application on multiple CPU cores - java

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output.
Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".
When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time.
Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.
When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.
The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.
What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?
I'm expecting this to be something very silly that I've overlooked.
I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367
This implements the feature I was looking for in Hadoop 0.21
It introduces the flag mapreduce.local.map.tasks.maximum to control it.
For now I've also found the solution described here in this question.

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.
Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.
Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

Just for clarification...
If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 (MAPREDUCE-1367)). Though you can submit multiple jobs at once and these getting executed in parallel then.
All those
mapred.tasktracker.{map|reduce}.tasks.maximum
properties do only apply to the hadoop running in distributed mode!
HTH
Joahnnes

According to this thread on the hadoop.core-user email list, you'll want to change the mapred.tasktracker.tasks.maximum setting to the max number of tasks you would like your machine to handle (which would be the number of cores).
This (and other properties you may want to configure) is also documented in the main documentation on how to setup your cluster/daemons.

What you want to do is run Hadoop in "pseudo-distributed" mode. One machine, but, running task trackers and name nodes as if it were a real cluster. Then it will (potentially) run several workers.
Note that if your input is small Hadoop will decide it's not worth parallelizing. You may have to coax it by changing its default split size.
In my experience, "typical" Hadoop jobs are I/O bound, sometimes memory-bound, way before they are CPU-bound. You may find it impossible to fully utilize all the cores on one machine for this reason.

Related

Runtime.getRuntime().availableProcessors() returning 1 even though many cores available on ECS AWS

I am running a task via Docker on AWS's ECS. The task does some calculations which are CPU-bound, which I would like to run in parallel. I start a thread pool with the number of threads specified in Runtime.getRuntime().availableProcessors() which works fine locally on my PC. For some reason, on AWS ECS, this always returns 1, even though there are multiple cores available. Therefore my calculations run serially, and do not utilize the multiple cores.
For example, right now, I have a task running on a "t3.medium" instance which should have 2 cores according to the docs.
When I execute the following code:
System.out.println("Java reports " +
Runtime.getRuntime().availableProcessors() + " cores");
Then the following gets displayed on the log:
Java reports 1 cores
I do not specify the cpu parameter in ECS's task definition. I see that in the list of tasks within the ECS Management Console it has a column for "CPU" which reads 0 for my task. I also notice that in the list of instances (= VMs) it lists "CPU available" as 2048 which presumably has something to do with the fact the VM has 2 cores.
I would like my Java program to see all cores that the VM has to offer. (As would normally be the case when a Java program runs on a computer without Docker).
How do I go about doing that?

Thanks to #stdunbar in the comments for pointing me in the right direction.
EDIT: Thanks to #Imran in the comments. If you start lots of threads, they will absolutely be scheduled to multiple cores. This answer is only about getting Runtime.getRuntime().availableProcessors() to return the right value. Many "thread pools" start as many threads as that method returns: it should return the number of cores available.
There seem to be two main solutions, neither of which is ideal:
Set the cpu parameter in the task definition. For example, if you have 2 cores and want to use them both you have to set "cpu":2048 in the task's definition. This isn't very convenient for two reasons:
If you choose a bigger instance, you have to make sure to update this parameter.
If you want to have two tasks running simultaneously, both of which can sporadically use all cores for short-term activities, AWS will not schedule two tasks on a 2-core system with "cpu":2048. It says the VM is "full" from a CPU perspective. This goes against the timesharing (Unix etc.) philosophy of every task taking what it needs (for example, imagine on a desktop PC, if you run Word and Excel on a dual-core computer, and Windows wouldn't allow you to start any other tasks, on the grounds that Word might need all of one core, and Excel might do too, so if another program might need all the core at the same time, there wouldn't be enough cores.)
Use the -XX:ActiveProcessorCount=xx JVM option in JDK 10 onwards, as described here. This isn't convenient because:
As above, you have to change the value if you change your instance type.
I wrote a longer blog post describing my findings here: https://www.databasesandlife.com/java-docker-aws-ecs-multicore/

JMeter message throughput too low

I am trying to use JMeter to test an ActiveMQ cluster. As per requirements, I need to get at least 2k messages per second as a test. The issue is that I can't get to the required number of messages.
I am trying to test it against a local queue before going into the cluster, and the results are not good. In a PC (quite beefy) with Windows 10 installed, the best I can do is a few hundred messages per second. In a Mac (Macbook Pro) with OSX 10, I can pump it up to around 1.5k.
I have tried different configurations in JMeter: varying the number of threads, size of messages, Request&Response mode vs Request only... But nothing does the trick.
When I run custom code, I can push around 10k messages into the queue in a second. Are there any particular configurations that I might be missing? I have been through the tutorials online, but I can't find anything that fixes the issue.

JMeter default configuration is good for tests development and debugging, but when it comes to conducting the high load you need to remember several important points:
Don't use GUI for tests execution, you are supposed to be running tests using non-GUI mode
Default JVM Heap allocation is 512 Mb only, you will definitely need to raise this setting in JMeter startup script. Same applies to stack size and garbage collector settings. See JVM Tuning: Heapsize, Stacksize and Garbage Collection Fundamental article to learn more about JVM internals.
Don't use Listeners during the load test, they cause huge overhead in terms of resources utilization and don't add any value.
Reduce usage of Pre/Post Processors and Assertions to the absolute minimum.
See 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure for above points explained and few more tips.
As a last resort in case you hit the hardware limits of a single load generator machine you can always consider running JMeter in distributed mode and add more JMeter engines.

I found the answer after fiddling with it for hours. Turns out there is a checkbox that is unticked by default which makes all messages persistent. When I ticked it, I got the throughput that I was looking for.

Java task distribution and collection on a grid

I have an application running on a cluster/grid where I need to run N tasks that do not have to communicate. I only need to collect the result of each task. So I have a Master distributing the tasks to some Slaves (possibly running on different hosts) and combining all the results at the end.
As the cluster is controlled by a batch system the configuration of my nodes changes for each run and I get a list of nodes that have been assigned to me for my job.
I'm looking for a library (pure Java) to help me with this. I looked at the following:
MPJ - doesn't work for me because of the way that MPJ runs when there are multiple processors available on the same machine. It uses custom class loaders and this gives me problems with a native library that I'm loading (it's loaded multiple times because the custom class loaders load the class multiple times).
Hazelcast - works in principle but it's not really made for this (I can distribute jobs with a queue and put the results back in another queue but it seems like a bit of an overkill). What I like is that it's easy to set up the group of nodes (in principle just one needs to be specified and the other nodes can just connect to it).
Simon/RMI - I guess I could let each slave register with the master and then let the master distribute jobs to each slave. Or let each slave request a queue where the jobs are queued and a queue where the results should be stored from the master.
Cajo - would in principle work but I don't want to have multicast on the grid network and there seems to be no way around this for Cajo.
RabbitMQ - I don't like to have an extra server running and it's not pure Java. Same for ZeroMQ.
Akka - Seems to be overkill as well. And a lot of configuration to set up the group of nodes.
Hadoop - Like Akka seems to be an overkill, especially the configuration to set up the group of nodes.
JPPF - Seems to be more suited for setting up a long running cluster of servers and nodes. After my application finishes I need to stop all servers and nodes. Also it seems to rely on Serialization of the Tasks which is not an option for me (see further below)
So I would stick with either Hazelcast or Simon. Which one is better suited for this kind of application? Does anyone know another library (not too heavy, not too much configuration). Any other suggestions?
Hazelcasts ExecutorService is not an option btw. because I'm using some JNI and so the serialization would be a pain.

I finally settled with MPJ. The problem with custom class loaders can simply be circumvented by not using the scripts included in MPJ but instead calling the java program directly with the following parameters:
java class rank mpj-config niodev [additional arguments for the application]
The rank, mpj-config and niodev arguments will be removed by the MPI_Init call.
mpj-config is a file listing number of ranks, a switching threshold for the message protocol and a list of hosts with corresponding port number and rank. niodev specifies the communication mechanism (see MPJ-Express documentation for more details). The config file could look like this:
3
131072
a6444#20000#0
a6444#20002#1
a6413#20000#2
It is important to seperate the port numbers on the same host by 2, because MPJ uses the specified port + the next one (so e.g. 20000 and 20001).
Simon and Hazelcast were also good solutions but they were a little bit slower than MPJ. Especially the initialization for both is quite a bit slower.

Let me know if this solution doesn't work.
Hazelcast provides a multi node task execution with Executor Service.
So you'll get the list of nodes that you want a task to be executed.
And then
HazelcastInstance h = Hazelcast.newHazelcastInstance();
Set<Member> members = h.getCluster().getMembers();//or any subset given your requirement
MultiTask<Long> multitask = new MultiTask<Long>(new MyCallableTask("default"), members);
ExecutorService es = h1.getExecutorService();
es.execute(multitask);
Collection<Long> results = multitask.get();
The only thing you need to do is to have the class of MyCallableTask in the classpath of all nodes.

With exactly the same workload one server shows high cpu load

I'm running hadoop and have 2 identically configured servers in the cluster. They're running the same task, same configuration, same everything, and both are totally dedicated as hadoop task nodes (workers).
The job I'm running through this cluster is highly IO bound.
On one server I see 60-100MB/sec of IO and a CPU load of 5-10, on the other server I see 40-60MB/sec of IO and a CPU load of 60-90 (and the box is almost unusable in terms of even running a simple shell).
I've run smartctl and don't get any disk warnings.
Any suggestsions on what I might do next to identify the root difference between these boxes? These results have been consistent over many hours of processing.

It smells of partition misalignment on 4096-byte physical / 512-byte logical disk sectors.

Controlling maximum Java standalone running in Linux

We've developed a Java standalone program. We've configured in our Linux (RedHat ES 4) cron
schedule to execute this Java standalone every 10 minutes. Each standalone
may sometime take more than 1 hour to complete, or sometime it may complete
even within 5 minutes.
My problem/solution I'm looking for is, the number of Java standalones executing
at any time should not exceed, for example, 5 process. So, for example,
before even a Java standalone/process starts, if there are already 5 processes running,
then this process should not be started; otherwise this would indirectly start
creating OutOfMemoryError problems. How do I control this? I would also like to make this 5 process limit configurable.
Other Information:
I've also configured -Xms and -Xmx heap size settings.
Is there any tool/mechanism by which we can control this?
I also heard about Java Service Wrapper. What is this all about?

You can create 5 empty files (with names "1.lock",...,"5.lock") and make the app to lock one of them to execute (or exit if all files are already locked).

First, I am assuming you are using the words "thread" and "process" interchangably. Two ideas:
Have the cron job be a script that will check the currently running processes and count them. If less than threshold spawn new process, otherwise exit, here threshold can be defined in your script.
Have the main method in your executing java file check some external resource (a file, database table, etc) for a count of running processes, if it is below threshold increment and start process, otherwise exit (this is assuming the simple main method will not be enough to cause your OOME problem). You may also need to use an appropriate locking mechanism on the external resource (though if your job is every 10 minutes, this may be overkill), here you could defin threshold in a .properties, or some other configuration file for your program.

Java Service Wrapper helps you set up a java program as a Windows service or a *nix daemon. It doesn't really deal with the concurrency issue you are looking at--the closest thing is a config setting that disallows concurrent instances if its a Windows service.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Running a standalone Hadoop application on multiple CPU cores - java

Related

Runtime.getRuntime().availableProcessors() returning 1 even though many cores available on ECS AWS

JMeter message throughput too low

Java task distribution and collection on a grid

With exactly the same workload one server shows high cpu load

Controlling maximum Java standalone running in Linux

Categories

Resources