SGE : Parallel Environment for a multithreaded java code - java

I have written a multi threaded java code, which when runs creates 8 threads and the computation continues on these threads. I would like to submit this job to a SGE cluster but I am not sure which parallel environment (pe) should I choose? or should I create one? I am new to SGE. The simple way would be to run it in serial mode, but that's inefficient.
Regarding creating a pe, where does it need to be created? Should the SGE deamon also need to have this pe ? When I submitted a job with some random name as pe, I got
job rejected: the requested parallel environment "openmpi" does not exist

Threaded applications must get all their slots on a single node. That's why you need a Parallel Environment with allocation_rule set to $pe_slots. Parallel environments are configured by the SGE administrator using the qconf -ap PE_name. As a user you can only get the list of the available PEs using qconf -spl and can query the configuration of a particular PE using qconf -sp PE_name. You can walk all PEs and see their allocation rules with the following (ba)sh script:
for pe_name in `qconf -spl`; do
echo $pe_name
qconf -sp $pe_name | grep allocation_rule
done
But you should already be talking to your SGE admin instead of trying to justify your off-topic question here.

Related

Elastic APM : cpu usage when running application.jar in local system

I am running a sample application jar on local system using elasticAPM agent.
Elastic APM show 2 different cpu stats (system/process).
Metrics explanation on official site says the same thing for both stats
https://www.elastic.co/guide/en/apm/server/current/exported-fields-system.html
Please explain, Is the "system cpu stats" is of my system even when the agent is connected to application.jar only using java command? If so, how can I check on elastic apm what else on my system in consuming cpu since only application is running during the load test.
java -javaagent:<agent.jar> -jar <app.jar>
The CPU usage shown below
The metrics shown in Kibana are sent by the APM agent that like you said has limited access to your environment. It basically says anything that is collected by the JVM running your JAR.
If you want to get further visibility into the CPU details of your local environment then you must augment your setup using Elastic MetricBeats that ships O.S level details about your machine that sees beyond what the JVM can see.
In the presentation below I show how to configure logs, metrics, and APM altogether.
https://www.youtube.com/watch?v=aXbg9pZCjpk

How to uniquely identify a java process after multiple restart

I have number of java process running on my machine. I need to track how many times each process is getting restarted.
For Example:
Let us consider two java process
Process 1 - Its restarted for 5 times.
Process 2 - Its restarted for 2 times.
I'm able to get the PID, java command of the running processes. But I could not able to differentiate once the process got restarted. Because the PID changed after the restart. Also I can't consider the java command because two instance of same application which has same command.
So what are the other ways to track the java process restart ?
You want your processes to keep the same identity after a restart. Ideally, you would have a parameter, system property or environment variable telling the process its identity.
As you state in the question, this identity cannot be passed on the command line. Thus, the process has to "find" its identity by acquiring an exclusive resource.
This resource could be a shared system implementing locks but it is probably to complex.
As exclusive resources, we have network sockets. So you could make your processes artificially opening a socket in the sole objective to make it acquire an identity.
You can use the code from https://stackoverflow.com/a/116113/2242270 in order to open a socket in a range. The identity of the process is then the port that could be opened.

Recurring "OutOfMemoryError: unable to create new native thread" with JBOSS EJB CLIENT

This error happens a lot when calling an EJB service on JBOSS EAP 6.4 and it always happens on EJBClientContext registerEJBReceiver / unregisterEJBReceiver. These methods both submit a runnable into a CachedThreadPool (Executors.newCachedThreadPool) named ejbClientContextTasksExecutorService.
The code can be viewed inside the class EJBClientContext:
https://raw.githubusercontent.com/wildfly/jboss-ejb-client/87aef56ab787f57a9508c6e2b0f876066ae464fe/src/main/java/org/jboss/ejb/client/EJBClientContext.java
I have a JBOSS Client application which is a batch that creates fixed number of 20 threads (with Executors.newCachedThreadPool) but each task invokes an EJB remote object that uses the CachedThreadPool of EJBClientContext.
The number of threads that are running inside the CachedThreadPool of EJBClientContext is unknown but I checked some OS limits that appear to be more than enough :
nproc > 100000
ulimit -u > 100000
kernel.pid_max > 100000
/proc/sys/kernel/threads-max > 150000
I have been watching thread consumption on the server during the whole duration of the batch with the following command:
ps -A -o pid,nlwp,cmd
The number of threads per process remains quite low (maximum 100 threads per process, for 2 or 3 processes at the same time).
Try to profile your application. If you are using Oracle JDK you can create a flight recording an analyse it with JMC. Then you have a good analyse over time until the error occurs.
Also found a good article. Try to compare the results from the analyse with it.

How to achieve Robot framework Parallel test execution on two different machines?

I'm automating web application using RobotFramework with selenium2library.
I'm looking for parallel test execution of two different test suites on two different machines at the same time.
I tried pabot for parallel execution. If i am running 3 instances in parallel on a single machine it is running. But I want to run it in different machines, for that i have tried the below code:-
first I start hub
java -jar <selenium.jar> -role hub
(optional port f.e.: -port 4444)
Then I run up nodes:
java -jar <selenium.jar> -role webDriver (for selenium 2 library) -hub http://<selenium hub ip>:4444/grid/register
(optional parameter remoteHost f.e.: -remoteHost http://127.0.0.1:5555)
Nodes can be run separately and be specified by additional parameter -remoteHost . This host can be used in selenium keyword Open Browser
Open Browser | url | browser=ff | alias=None | remote_url=False
Example:
${REMOTE_DRIVER}= Set Variable 127.0.0.1:5555/wd/hub
Open Browser www.google.com ff None ${REMOTE_DRIVER}
But after this suits are running in different machines one after another and not in parallel.
Is there any way to acheive it.
To run suites in parallel there are two components needed:
Selenium Grid, or other centralized Grid infra (SauceLabs, Zalenium, Aerokube Selenoid).
Parallel Executor (Pabot)
Natively Robot Framework only supports running 1 suite at a time. By extension this means that any Robot script that uses Selenium will only have 1 suite running at a time. In order to parallelize you will need to run multiple Robot Framework instances in parallel.
The Pabot project is a seperate application that runs a seperate robot framework instance per Suite (file). At the end it then merges all the seperate logs into a single log file. It has a few more features but that's the core.
From your description I take it that setting up a Grid where multiple nodes have joined succefully. If this is the case, then using the Grid server URL for connecting to your browser should suffice to have the nodes utilized.
Do make sure that the number of parallel Pabot processes (it's a parameter) do not exceed the number of available Selenium nodes.

Running a Job on Spark 0.9.0 throws error

I have a Apache Spark 0.9.0 Cluster installed where I am trying to deploy a code which reads a file from HDFS. This piece of code throws a warning and eventually the job fails. Here is the code
/**
* running the code would fail
* with a warning
* Initial job has not accepted any resources; check your cluster UI to ensure that
* workers are registered and have sufficient memory
*/
object Main extends App {
val sconf = new SparkConf()
.setMaster("spark://labscs1:7077")
.setAppName("spark scala")
val sctx = new SparkContext(sconf)
sctx.parallelize(1 to 100).count
}
The below is the WARNING message
Initial job has not accepted any resources; check your cluster UI to
ensure that workers are registered and have sufficient memory
how to get rid of this or am I missing some configurations.
You get this when either the number of cores or amount of RAM (per node) you request via setting spark.cores.max and spark.executor.memory resp' exceeds what is available. Therefore even if no one else is using the cluster, and you specify you want to use, say 100GB RAM per node, but your nodes can only support 90GB, then you will get this error message.
To be fair the message is vague in this situation, it would be more helpful if it said your exceeding the maximum.
Looks like Spark master can't assign any workers for this task. Either the workers aren't started or they are all busy.
Check Spark UI on master node (port specified by SPARK_MASTER_WEBUI_PORT in spark-env.sh, 8080 by default). It should look like this:
For cluster to function properly:
There must be some workers with state "Alive"
There must be some cores available (for example, if all cores are busy with the frozen task, the cluster won't accept new tasks)
There must be sufficient memory available
Also make sure your spark workers can communicate both ways with the driver. Check for firewalls, etc.
I had this exact issue. I had a simple 1-node Spark cluster and was getting this error when trying to run my Spark app.
I ran through some of the suggestions above and it was when I tried to run the Spark shell against the cluster and not being able to see this in the UI that I became suspicious that my cluster was not working correctly.
In my hosts file I had an entry, let's say SparkNode, that referenced the correct IP Address.
I had inadvertently put the wrong IP Address in the conf/spark-env.sh file against the SPARK_MASTER_IP variable. I changed this to SparkNode and I also changed SPARK_LOCAL_IP to point to SparkNode.
To test this I opened up the UI using SparkNode:7077 in the browser and I could see an instance of Spark running.
I then used Wildfires suggestion of running the Spark shell, as follows:
MASTER=spark://SparkNode:7077 bin/spark-shell
Going back to the UI I could now see the Spark shell application running, which I couldn't before.
So I exited the Spark shell and ran my app using Spark Submit and it now works correctly.
It is definitely worth checking out all of your IP and host entries, this was the root cause of my problem.
You need to specify the right SPARK_HOME and your driver program's IP address in case Spark may not able to locate your Netty jar server. Be aware that your Spark master should listen to the correct IP address which you suppose to use. This can be done by setting SPARK_MASTER_IP=yourIP in file spark-env.sh.
val conf = new SparkConf()
.setAppName("test")
.setMaster("spark://yourSparkMaster:7077")
.setSparkHome("YourSparkHomeDir")
.set("spark.driver.host", "YourIPAddr")
Check for errors regard to hostname, IP address and loopback. Make sure to set SPARK_LOCAL_IP and SPARK_MASTER_IP.
I had similar issue Initial job has not accepted any resource, fixed it by specify the spark correct download url on spark-env.sh or installing spark on all slaves.
export SPARK_EXECUTOR_URI=http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Categories