Failed to run Spark job on Yarn cluster - Retrying connect to server

Failed to run Spark job on Yarn cluster - Retrying connect to server - java

I setup my yarn cluster and also my spark cluster on the same machines but now I need to run a spark job with yarn using the client mode.
Here is my sample config for my job:
SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp")
.setMaster("yarn-client")// "yarn-cluster" or "yarn-client"
.set("es.nodes", "10.0.0.207")
.set("es.nodes.discovery", "false")
.set("es.cluster", "wp-es-reporting-prod")
.set("es.scroll.size", "5000")
.setJars(JavaSparkContext.jarOfClass(Demo.class))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.default.parallelism", String.valueOf(cpus * 2))
.set("spark.executor.memory", "10g")
.set("spark.num.executors", "40")
.set("spark.dynamicAllocation.enabled", "true")
.set("spark.dynamicAllocation.minExecutors", "10")
.set("spark.dynamicAllocation.maxExecutors", "50") .set("spark.logConf", "true");
This doesn't seems to work when I tried to run my Spark job
java -jar spark-test-job.jar"
I got this exception
405472 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to
server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
406473 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to
server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
Any help ?

Related

Jhipster registry try to register itself

I am using release package of Jhipster Registry v7.1.0
It's trying to connect itself at startup and throws exception
2021-10-09 22:17:45.578 WARN 1768 --- [ main] c.n.d.s.t.d.RetryableEurekaHttpClient : Request execution failed with message: java.net.ConnectException: Connection refused: connect
2021-10-09 22:17:45.578 INFO 1768 --- [ main] com.netflix.discovery.DiscoveryClient : DiscoveryClient_JHIPSTER-REGISTRY/jhipsterRegistry:73fc39b9052cc9c8bdea1a67d1e92d17 - was unable to refresh its cache! This periodic background refresh will be retried in 10 seconds. status = Cannot execute request on any known server stacktrace = com.netflix.discovery.shared.transport.TransportException: Cannot execute request on any known server
at com.netflix.discovery.shared.transport.decorator.RetryableEurekaHttpClient.execute(RetryableEurekaHttpClient.java:112)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.getApplications(EurekaHttpClientDecorator.java:134)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator$6.execute(EurekaHttpClientDecorator.java:137)
at com.netflix.discovery.shared.transport.decorator.SessionedEurekaHttpClient.execute(SessionedEurekaHttpClient.java:77)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.getApplications(EurekaHttpClientDecorator.java:134)
at com.netflix.discovery.DiscoveryClient.getAndStoreFullRegistry(DiscoveryClient.java:1101)
at com.netflix.discovery.DiscoveryClient.fetchRegistry(DiscoveryClient.java:1014)
at com.netflix.discovery.DiscoveryClient.<init>(DiscoveryClient.java:441)
at com.netflix.discovery.DiscoveryClient.<init>(DiscoveryClient.java:283)
at com.netflix.discovery.DiscoveryClient.<init>(DiscoveryClient.java:279)
at org.springframework.cloud.netflix.eureka.CloudEurekaClient.<init>(CloudEurekaClient.java:66)
Arguments that I use to start it
--spring.security.user.password=admin
--server.port=8761
--spring.profiles.active=prod
--spring.cloud.config.server.composite.0.type=native
--spring.cloud.config.server.composite.0.search-locations=file:./configuration
and application.yaml in configuration file is like that:
configserver:
name: JHipster Registry config server
status: Connected to the JHipster Registry config server
jhipster:
security:
authentication:
jwt:
secret: ZXlKaGJHY2lPaUpJVXpJMU5pSXNJblI1Y0NJNklrcFhWQ0o5LmV5SnpkV0lpT2lJeE1qTTBOVFkzT0Rrd0lpd2libUZ0WlNJNklrcHZhRzRnUkc5bElpd2lhV0YwSWpveE5URTJNak01TURJeWZRLmNUaElJb0R2d2R1ZVFCNDY4SzV4RGM1NjMzc2VFRm9xd3hqRl94U0p5UVE

Submitting Job to Remote Apache Spark Server

Apache Spark (v1.6.1) started as service on Ubuntu (10.10.0.102) machine, using ./start-all.sh.
Now need to submit job to this server remotely using Java API.
Following is Java client code running from different machine (10.10.0.95).
String mySqlConnectionUrl = "jdbc:mysql://localhost:3306/demo?user=sec&password=sec";
String jars[] = new String[] {"/home/.m2/repository/com/databricks/spark-csv_2.10/1.4.0/spark-csv_2.10-1.4.0.jar",
"/home/.m2/repository/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar",
"/home/.m2/repository/mysql/mysql-connector-java/6.0.2/mysql-connector-java-6.0.2.jar"};
SparkConf sparkConf = new SparkConf()
.setAppName("sparkCSVWriter")
.setMaster("spark://10.10.0.102:7077")
.setJars(jars);
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
Map<String, String> options = new HashMap<>();
options.put("driver", "com.mysql.jdbc.Driver");
options.put("url", mySqlConnectionUrl);
options.put("dbtable", "(select p.FIRST_NAME from person p) as firstName");
DataFrame dataFrame = sqlContext.read().format("jdbc").options(options).load();
dataFrame.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("quote", "\"")
.option("quoteMode", QuoteMode.NON_NUMERIC.toString())
.option("escape", "\\")
.save("persons.csv");
Configuration hadoopConfiguration = javaSparkContext.hadoopConfiguration();
FileSystem hdfs = FileSystem.get(hadoopConfiguration);
FileUtil.copyMerge(hdfs, new Path("persons.csv"), hdfs, new Path("\home\persons1.csv"), true, hadoopConfiguration, new String());
As per code need to convert RDBMS data to csv/json using Spark. But when I run this client application, able to connect to remote spark server but in console continuously receiving following WARN message
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
And at server side on Spark UI in running applications > executor summary > stderr log, received following error.
Exception in thread "main" java.io.IOException: Failed to connect to /192.168.56.1:53112
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /192.168.56.1:53112
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
But there no any IP address configured as 192.168.56.1. So is there any configuration missing.

Actually my client machine(10.10.0.95) is Windows machine. When I tried to submit Spark job using another Ubuntu machine(10.10.0.155), I am able to run same Java client code successfully.
As I debugged in Windows client environment, when I submit spark job following log displayed,
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.56.1:61552]
INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61552.
INFO MemoryStore: MemoryStore started with capacity 2.4 GB
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4044.
INFO SparkUI: Started SparkUI at http://192.168.56.1:4044
As per log line number 2, its register client with 192.168.56.1.
Elsewhere, in Ubuntu client
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.10.0.155:42786]
INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 42786.
INFO MemoryStore: MemoryStore started with capacity 511.1 MB
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://10.10.0.155:4040
As per log line number 2, its register client with 10.10.0.155 same as actual IP address.
If anybody find what is the problem with Windows client, let community know.
[UPDATE]
I am running this whole environment in Virtual Box. Windows machine is my host and Ubuntu is guest. And Spark installed in Ubuntu machine. In Virtual box environment Virtual box installing Ethernet adapter VirtualBox Host-Only Netwotk with IPv4 address : 192.168.56.1. And Spark registering this IP as client IP instead of actual IP address 10.10.0.95.

Spark worker retry to connect to master but receive java.util.concurent.RejectedExecutionException

I have setup a cluster on 12 machine and the spark workers on slave machines could disassociate with the master everyday. That means they could looks work well for a period of time during a day, but then the slaves would all disassociate and then be shutdowned.
The log of the worker would look like below:
16/03/07 12:45:34.828 INFO Worker: Retrying connection to master (attempt # 1)
16/03/07 12:45:34.830 INFO Worker: Connecting to master host1:7077...
16/03/07 12:45:34.826 INFO Worker: Retrying connection to master (attempt # 2)
16/03/07 12:45:45.830 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1c5651e9 rejected from java.util.concurrent.ThreadPoolExecutor#671ba687[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 2]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
...
16/03/07 12:45:45.853 Info ExecutorRunner: Killing process!
16/03/07 12:45:45.855 INFO ShutdownHookManager: Shutdown hook called
The log of the master would look like below:
16/03/07 12:45:45.878 INFO Master: 10.126.217.11:51502 got disassociated, removing it.
16/03/07 12:45:45.878 INFO Master: Removing worker worker-20160303035822-10.126.217.11-51502 on 10.126.217.11:51502
Information for the machines:
40 cores and 256GB memory per machine
spark version: 1.5.1
java version: 1.8.0_45
The spark application runs on this cluster with configuration below:
spark.cores.max=360
spark.executor.memory=32g
Is it a memory issue on slave or on master machine?
Or is it a network issue between slaves and master machine?
Or any other problems?
Please advise.
thanks

Hadoop 2.6 Connecting to ResourceManager at /0.0.0.0:8032

I´m trying to run the following Spark example under Hadoop 2.6, but I get the following error:
INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 and the Client enters in a loop trying to connect. I´m running a cluster of two machines, one master and a slave.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
This is the error I get:
15/12/06 13:38:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/06 13:38:29 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/12/06 13:38:30 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
15/12/06 13:38:31 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
15/12/06 13:38:32 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
15/12/06 13:38:33 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
15/12/06 13:38:34 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
jps
hduser#master:/usr/local/spark$ jps
4930 ResourceManager
4781 SecondaryNameNode
5776 Jps
4608 DataNode
5058 NodeManager
4245 Worker
4045 Master
My /etc/host/
/etc/hosts
192.168.0.1 master
192.168.0.2 slave
The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes

This error mainly comes when hostname is not configured correctly ...Please check if hostname is configured correctly and same as you have mentioned for Resourcemanager...

I had faced the same problem. I solved it.
Do the Following steps.
Start Yarn by using command: start-yarn.sh
Check Resource Manager by using command: jps
Add the following code to the configuration
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>

I had also encountered the same issue where I was not able to submit the spark job with spark-submit.
The issue was due to the missing HADOOP_CONF_DIR path while launching the Spark job So, whenever you are submitting the job, set HADOOP_CONF_DIR to appropriate HADOOP CONF directory.
Like export HADOOP_CONF_DIR=/etc/hadoop/conf

You need to make sure that yarn-site.xml is on the class path and also make sure that the relevant properties are marked with true element.

Similar export HADOOP_CONF_DIR=/etc/hadoop/conf was a good idea for my case in flink on yarn when i run ./bin/yarn-session.sh -n 2 -tm 2000.

As you can see here yarn.resourcemanager.address is calculated based on yarn.resourcemanager.hostname which its default value is set to 0.0.0.0. So you should configure it correctly.
From the base of the Hadoop installation, edit the etc/hadoop/yarn-site.xml file and add this property.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
Exucuting start-yarn.sh again will put your new settings into effect.

I have got the same problem. My cause is that the times are not the same between machines since my Resource Manager is not on the master machine. Just one second difference can cause yarn connection problem. A few more seconds difference can cause your name node and date node unable to start. Use ntpd to configure time synchronization to make sure the times are exactly same.

Not able to run the examples of HBase-The definitive guide

I've been trying to run examples from HBase-The definitve guide and i've been encountering with this error and i'm not able to get past it. I'm running in Stand alone mode if that helps.
Exception in thread "main" org.apache.hadoop.hbase.MasterNotRunningException: �
17136#ubuntulocalhost,32992,1373877731444
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:615)
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:94)
at util.HBaseHelper.<init>(HBaseHelper.java:29)
at util.HBaseHelper.getHelper(HBaseHelper.java:33)
at client.PutExample.main(PutExample.java:22)
But my HMaster process is running:
hduser#ubuntu:/home/ubuntu/hbase-book/ch03$ jps
17602 Jps
8709 NameNode
8929 DataNode
9472 TaskTracker
9252 JobTracker
9172 SecondaryNameNode
17136 HMaster
This is my hbase-site.xml file:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase/hbase-data/</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper-data/</value>
This is my /etc/hosts file:
127.0.0.1 localhost
127.0.1.1 ubuntu
127.0.0.1 ubuntu.ubuntu-domain ubuntu
Specifically, i'm trying to run the 3rd chapter examples and i'm just not understanding why my setup is not running..
Any idea where i'm going wrong?
Edit: Here are the logs:
2013-07-15 03:56:32,663 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:60119
2013-07-15 03:56:32,672 WARN org.apache.zookeeper.server.ZooKeeperServer: Connection request from old client /127.0.0.1:60119; will be dropped if server is in r-o mode
2013-07-15 03:56:32,672 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:60119
2013-07-15 03:56:32,674 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x13fe17e7f1d0006 with negotiated timeout 40000 for client /127.0.0.1:60119
2013-07-15 03:57:11,653 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.17 MB, free=247.24 MB, max=248.41 MB, blocks=2, accesses=68, hits=55, hitRatio=80.88%, , cachingAccesses=61, cachingHits=53, cachingHitsRatio=86.88%, , evictions=0, evicted=6, evictedPerRun=Infinity
2013-07-15 03:57:14,333 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x13fe17e7f1d0006, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:724)
2013-07-15 03:57:14,334 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:60119 which had sessionid 0x13fe17e7f1d0006
2013-07-15 03:57:24,551 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=1 regions=1 average=1.0 mostloaded=1 leastloaded=1
2013-07-15 03:57:24,568 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#189ddf
2013-07-15 03:57:24,578 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 1 catalog row(s) and gc'd 0 unreferenced parent region(s)

It has nothing to do with standalone or distributed mode. Make sure your setup is working fine. I can see that RegionServer and Zookeeper are not running. Comment out the line 127.0.1.1 ubuntu in your /etc/hosts file and restart HBase. You might have to kill it.
P.S : Since you already have Hadoop configured and it is running fine, you can run HBase in pseudo-distributed setup.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.