Query dataframe using spark-sql from remote client

Query dataframe using spark-sql from remote client - java

I have a client application from where I need to remotely execute queries on spark using spark-sql. I am able to do it from spark-shell but how can I remotely execute them from my scala based client application.
I have tried the following code :
val conf = new SparkConf().set("spark.shuffle.blockTransferService", "nio").setMaster("spark://master:port").setAppName("Query Fire").set("spark.hadoop.validateOutputSpecs", "true")
.set("spark.local.dir", "/tmp/spark-temp")
.set("spark.driver.memory", "4G").set("spark.executor.memory", "4G")
val spark = SparkContext.getOrCreate(conf)
I tried giving the default port 7077 but it is not open. I have a cloudera based spark installation which it seems is not running spark standalone.
The error I get running the code when I try to give yarn resourcemanager port 8042 is the following:
16/09/16 20:14:36 WARN TransportChannelHandler: Exception in
connection from /192.168.0.171:8042 java.io.IOException: Connection
reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Is there any way to get around this to remotely call spark-sql via jdbc client like we can do for Hive queries??

Related

MySql error "Cannot create PoolableConnectionFactory (Could not create connection to database server.)" while connecting with azkaban?

I am trying to connect Azkaban ( A job scheduler for hadoop) with my local mysql. The configiration file of azkaban looks like:
database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban
#Changed by Prakhar for azkaban , Azkaban
mysql.user=root
mysql.password= [ Password of mysql ]
My MySql has a database named "azkaban" and i am able to login mysql using command:
./mysql -u root -p
Also mysql is working on port 3306, which i have verified.
Still i am unable to connect to mysql. The logs of azkaban looks like this:
2020/04/11 22:38:05.584 +0530 ERROR [MySQLDataSource] [Azkaban] Failed to find write-enabled DB connection. Wait 15 seconds and retry. No.Attempt = 1
java.sql.SQLException: Cannot create PoolableConnectionFactory (Could not create connection to database server.)
at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:2294)
at org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:2039)
at azkaban.db.MySQLDataSource.getConnection(MySQLDataSource.java:76)
at org.apache.commons.dbutils.AbstractQueryRunner.prepareConnection(AbstractQueryRunner.java:175)
at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:286)
at azkaban.db.DatabaseOperator.query(DatabaseOperator.java:68)
at azkaban.executor.ExecutorDao.fetchActiveExecutors(ExecutorDao.java:53)
at azkaban.executor.JdbcExecutorLoader.fetchActiveExecutors(JdbcExecutorLoader.java:266)
at azkaban.executor.ExecutorManager.setupExecutors(ExecutorManager.java:223)
at azkaban.executor.ExecutorManager.<init>(ExecutorManager.java:131)
at azkaban.executor.ExecutorManager$$FastClassByGuice$$e1c1dfed.newInstance(<generated>)
at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:89)
at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:111)
at

Few pointers that might help you to resolve the issue.
Please check the version of MySQL installed vs the MySQL client mentioned in the Gradle file used to build Azkaban
Preferably use MySQL 5.x version
Make sure to install the compatible client version
Also, create the Mysql tables in the database before starting the web executor

Connect from Java app to Google Cloud Bigtable which running on Docker

I want to connect to Google Cloud Bigtable which running on Docker:
docker run --rm -it -p 8086:8086 -v ~/.config/:/root/.config \
bigtruedata/gcloud-bigtable-emulator
It starts without any problems:
[bigtable] Cloud Bigtable emulator running on 127.0.0.1:8086
~/.config it is my default credentials that I configured in this way:
gcloud auth application-default login
I used Java-code from official sample HelloWorld.
Also, I changed connection configuration like this:
Configuration conf = BigtableConfiguration.configure("projectId", "instanceId");
conf.set(BigtableOptionsFactory.BIGTABLE_HOST_KEY, "127.0.0.1");
conf.set(BigtableOptionsFactory.BIGTABLE_PORT_KEY, "8086");
conf.set(BigtableOptionsFactory.BIGTABLE_USE_PLAINTEXT_NEGOTIATION, "true");
try (Connection connection = BigtableConfiguration.connect(conf)) {
...
And I set BIGTABLE_EMULATOR_HOST=127.0.0.1:8086 environment variable in a configuration for my app in IntelliJ Idea.
But when I run my Java app, it gets stuck on admin.createTable(descriptor); and shows this log:
...
16:42:44.697 [grpc-default-executor-0] DEBUG
com.google.bigtable.repackaged.io.grpc.netty.shaded.io.netty.util.Recycler
- -Dio.netty.recycler.ratio: 8
After some time it shows log about BigtableClientMetrics and then throws an exception:
java.net.NoRouteToHostException: No route to host
I get the same problem when trying to run Google Cloud Bigtable with my own Dockerfile.
When I run Google Cloud Bigtable with this command:
gcloud beta emulators bigtable start
my app completed successfully.
So, how to solve this problem?
UPDATE:
Now I have this exception:
io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason
and before this another exception is thrown:
java.io.IOException: Connection reset by peer

No able to connect to spark cluster via sparklyr package when my custom package method is invoked via OpenCpu

I have created an R package that makes use of the sparklyr capabilities within a dummy hello function. My package does a very simple thing as connection to a spark cluster, print the spark version and disconnect. The package is successfully clean and build and is successfully executed from R and Rstudio.
# Connect to Spark cluster
spark_conn <- sparklyr::spark_connect(master = "spark://elenipc.home:7077", spark_home = '/home/eleni/spark-2.2.0-bin-hadoop2.7/')
# Print the version of Spark
sv<- sparklyr::spark_version(spark_conn)
print(sv)
# Disconnect from Spark
sparklyr::spark_disconnect(spark_conn)
It is very important for me to be able to execute the hello function from OpenCpu rest api. (I have used opencpu api for executing many other custom created packages.)
When invoking opencpu api like:
curl http://localhost/ocpu/user/rstudio/library/myFirstBigDataPackage/R/hello/print -X POST
i get the following response:
Failed while connecting to sparklyr to port (8880) for sessionid (89615): Gateway in port (8880) did not respond.
Path: /home/eleni/spark-2.2.0-bin-hadoop2.7/bin/spark-submit
Parameters: --class, sparklyr.Shell, '/home/rstudio/R/x86_64-pc-linux-gnu-library/3.4/sparklyr/java/sparklyr-2.2-2.11.jar', 8880, 89615
Log: /tmp/ocpu-temp/file26b165c92166_spark.log
---- Output Log ----
Error occurred during initialization of VM
Could not allocate metaspace: 1073741824 bytes
---- Error Log ----
In call:
force(code)
Of course allocate more memory to both java & spark executor does not resolve the issue. permission issues are also discarded as i already configured the etc/apparmor.d/opencpu.d/custom file so as to permit opencpu to have rwx privileges on spark. It seems to be a connectivity issue that i do not know how to face. During method invocation via opencpu api spark logs do not even print something.
For you info my environment configuration is as follows:
java version "1.8.0_65"
R version 3.4.1
RStudio version 1.0.153
spark-2.2.0-bin-hadoop2.7
opencpu 1.5 (compatible with my Ubuntu 14.04.3 LTS)
Thank you very much for you support and time!!!

Connecting RedShift on AWS through Socks proxy from remote Hadoop/UNIX server

I need to establish a JDBC connection from my Hadoop Linux environment to the Redshift database. I've tried by setting the env variables in the shell and initiate sqoop by setting the below environment variables, it didn't work.
$export socks_proxy=hostName
$export http_proxy=hostName
$export http_proxyPort=1080
$export https_proxy=hostName
$export https_proxyPortName=1080
$export socksProxyHost=hostName
Tried setting up from the spark shell and running the JDBC connection from the shell that didn't work either here is my spark shell and JDBC connection string
spark-shell --conf "spark.driver.extraJavaOptions=-Dsocks.ProxyHost=HostName
--jars RedshiftJDBC42-1.2.7.1003.jar,spark-redshift_2.10-1.0.0.jar,spark-
redshift_2.10-2.0.0.jar
val df = sqlContext.load("jdbc", Map("url" ->
"jdbc:redshift://RedShiftHostName:Port/databaseName?ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory&user=UNAME&password=PASSWORD", "dbtable" -> "(select * from information_schema.tables) tmp", "driver" -> "com.amazon.redshift.jdbc.Driver"));`
Here is the common error I'm getting from sqoop and spark
Not sure if I'm doing something wrong from spark/sqoop ? Can I know if there is any other way to run the JDBC from the shell script without leveraging the Hadoop echo systems? May be with some java code through the shell ?
Thanks in advance, appreciate all your answers with examples.

HBase 1.2.1 standalone in Docker unable to connect

I want to connect to HBase running in standalone in a docker, using Java and the HBase API
I use this code to connect :
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "163.172.142.199");
config.set("hbase.zookeeper.property.clientPort","2181");
HBaseAdmin.checkHBaseAvailable(config);
Here is my /etc/hosts file
127.0.0.1 localhost
XXX.XXX.XXX.XXX hbase-srv
Here is the /etc/hosts file from my docker (named hbase-srv)
XXX.XXX.XXX.XXX hbase-srv
With this configuration, I get a connection refused error :
INFO | Initiating client connection, connectString=163.172.142.199:2181 sessionTimeout=90000 watcher=hconnection-0x6aba2b860x0, quorum=163.172.142.199:2181, baseZNode=/hbase
INFO | Opening socket connection to server 163.172.142.199/163.172.142.199:2181. Will not attempt to authenticate using SASL (unknown error)
INFO | Socket connection established to 163.172.142.199/163.172.142.199:2181, initiating session
INFO | Session establishment complete on server 163.172.142.199/163.172.142.199:2181, sessionid = 0x15602f8d8dc0002, negotiated timeout = 40000
INFO | Closing zookeeper sessionid=0x15602f8d8dc0002
INFO | Session: 0x15602f8d8dc0002 closed
INFO | EventThread shut down
org.apache.hadoop.hbase.MasterNotRunningException: com.google.protobuf.ServiceException: java.net.ConnectException: Connection refused
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1560)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1580)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1737)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isMasterRunning(ConnectionManager.java:948)
at org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:3159)
at hbase.Benchmark.main(Benchmark.java:26)
However, if I remove the lines XXX.XXX.XXX.XXX hbase-srv from both /etc/hosts files I get the error unknown host : hbase-srv
I have also checked, I can successfully telnet to my hbase docker on the client port.
On the docker, all the ports used by HBase are opened and binded to the same number (60000 on 60000, 2181 on 2181, etc).
I also wanted to add that all was fine when I used this configuration on localhost.
If you can't give me an answer to my problem, could you at least give me a procedure to deploy a standalone hbase on a docker.
UPDATE : Here is my Docker file
FROM java:openjdk-8
ADD hbase-1.2.1 /hbase-1.2.1
WORKDIR /hbase-1.2.1
# ZooKeeper
EXPOSE 2181
# HMaster
EXPOSE 60000
# HMaster Web
EXPOSE 60010
# RegionServer
EXPOSE 60020
# RegionServer Web
EXPOSE 60030
EXPOSE 16010
RUN chmod 755 /hbase-1.2.1/bin/start-hbase.sh
CMD ["/hbase-1.2.1/bin/start-hbase.sh"]
My HBase shell is working, I also tried to open the port using iptables for tcp and udp but still the same problem

There are two problems with your Dockerfile:
use hbase master start instead of start-hbase.sh
regionserver is actually not running on 60020
The 2nd problem is not so easy to solve. If run hbase standalone with version >= 1.2.0 (not sure, I'm running 1.2.0), hbase will use ephemeral port instead of the default port or the port you provide in hbase-site.xml which makes it very hard to provide hbase service in docker using the original version.
I add a property named hbase.localcluster.port.ephemeral and managed to build a standalone hbase in docker, which you can reference here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Query dataframe using spark-sql from remote client - java

Related

MySql error "Cannot create PoolableConnectionFactory (Could not create connection to database server.)" while connecting with azkaban?

Connect from Java app to Google Cloud Bigtable which running on Docker

No able to connect to spark cluster via sparklyr package when my custom package method is invoked via OpenCpu

Connecting RedShift on AWS through Socks proxy from remote Hadoop/UNIX server

HBase 1.2.1 standalone in Docker unable to connect

Categories

Resources