I need to establish a JDBC connection from my Hadoop Linux environment to the Redshift database. I've tried by setting the env variables in the shell and initiate sqoop by setting the below environment variables, it didn't work.
$export socks_proxy=hostName
$export http_proxy=hostName
$export http_proxyPort=1080
$export https_proxy=hostName
$export https_proxyPortName=1080
$export socksProxyHost=hostName
Tried setting up from the spark shell and running the JDBC connection from the shell that didn't work either here is my spark shell and JDBC connection string
spark-shell --conf "spark.driver.extraJavaOptions=-Dsocks.ProxyHost=HostName
--jars RedshiftJDBC42-1.2.7.1003.jar,spark-redshift_2.10-1.0.0.jar,spark-
redshift_2.10-2.0.0.jar
val df = sqlContext.load("jdbc", Map("url" ->
"jdbc:redshift://RedShiftHostName:Port/databaseName?ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory&user=UNAME&password=PASSWORD", "dbtable" -> "(select * from information_schema.tables) tmp", "driver" -> "com.amazon.redshift.jdbc.Driver"));`
Here is the common error I'm getting from sqoop and spark
Not sure if I'm doing something wrong from spark/sqoop ? Can I know if there is any other way to run the JDBC from the shell script without leveraging the Hadoop echo systems? May be with some java code through the shell ?
Thanks in advance, appreciate all your answers with examples.
Related
I have a test container for Mysql and I need to import the dump file after the container started. I've tried two options below.
public class AbstractTest {
public static MySQLContainer<?> mySQLContainer = new MySQLContainer<>("mysql:5.7");
static {
mySQLContainer
.withDatabaseName("myDatabase")
.withCopyFileToContainer(
MountableFile.forClasspathResource("init.sql", 0744),
"init.sql")
.withUsername("root")
.withPassword("root")
.start();
}
#PostConstruct
#SneakyThrows
public void init() {
option 1 // mySQLContainer.execInContainer("mysql -u root -proot myDatabase < init.sql");
option 2 // mySQLContainer.execInContainer("mysql", "-u", "root", "-proot", "myDatabase", "<", "init.sql");
}
////
}
and still no success - it looks like mysql can't parse my command properly cause I get next as an answer:
mysql Ver 14.14 Distrib 5.7.35, for Linux (x86_64) using EditLine wrapper
Copyright (c) 2000, 2021, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Usage: mysql [OPTIONS] [database]
-?, --help Display this help and exit.
-I, --help Synonym for -?
--auto-rehash Enable automatic rehashing. One doesn't need to use
'rehash' to get table and field completion, but startup
////.....
If use next command
option 2 // mySQLContainer.execInContainer("mysql", "-u", "root", "-proot");
it works fine, but this is not what I wanted
mysql -u root -proot mydatabase < init.sql command works fine if I just connect to the container via bash from cli.
So my question - how to import SQL dump file in MySQLContainer in JUnit test containers by executing the command in image?
UPDATE:
I figured out that there is some thing wrong with parsing of "<" sign.
So, basically this works fine from CLI:
docker exec -i mycontainer mysql -uroot -proot myDatabase < init.sql
But this is not working from Java:
mySQLContainer.execInContainer("mysql","-uroot","-proot","myDatabase","<","init.sql");
MySQL can load a dump file automatically if you put it at a special path.
From the docs of the MySQL Docker image:
When a container is started for the first time, a new database with
the specified name will be created and initialized with the provided
configuration variables. Furthermore, it will execute files with
extensions .sh, .sql and .sql.gz that are found in
/docker-entrypoint-initdb.d. Files will be executed in alphabetical
order. You can easily populate your mysql services by mounting a SQL
dump into that directory and provide custom images with contributed
data. SQL files will be imported by default to the database specified
by the MYSQL_DATABASE variable.
So the easiest is to copy the file there with something like:
.withCopyFileToContainer(MountableFile.forClasspathResource("init.sql"), "/docker-entrypoint-initdb.d/schema.sql")
You need an applications.properties suitable for MySql, something like:
spring.jpa.hibernate.ddl-auto=update
spring.datasource.url=jdbc:mysql://${MYSQL_HOST:localhost}:3306/db_example
spring.datasource.username=springuser
spring.datasource.password=ThePassword
spring.datasource.driver-class-name =com.mysql.jdbc.Driver
#spring.jpa.show-sql: true
Then you can add a data.sql in your /src/test/resources which will automatically be run.
Here's is what I have successfully done so far on SCDF Local Server
I have successfully deployed SCDF server on my local and also I have used Kafka and Zookeeper config parameters with it i.e
mymac$ java -jar spring-cloud-dataflow-server-local-1.3.0.RELEASE.jar
--spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.brokers=localhost:9092
--spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.zkNodes=localhost:2181
I was able to create my stream
ingest = producer-app > :broker1
filter = :broker1 > filter-app > :broker2
Now I need help to do the exact same thing on PCFDev
I have my PCFDEv running
I have to deploy SCDF-Cloudfoundry jar with my local kafka and zookeeper parameters to pcfDev but when I do the following steps it gives me an error that its
1.1) cf push -f manifest-scdf.yml --no-start -p /XXX/XXX/XXX/spring-cloud-dataflow-server-cloudfoundry-1.3.0.BUILD-SNAPSHOT.jar -k 1500M
this runs good...no problem. but 1.2
1.2) cf start dataflow-server --spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.brokers=host.pcfdev.io:9092 --spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.zkNodes=host.pcfdev.io:2181
gives me this error:--
Incorrect Usage: unknown flag `spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.brokers'
below is my manifest-scdf.yml file
---
instances: 1
memory: 2048M
applications:
- name: dataflow-server
host: dataflow-server
services:
- redis
- rabbit
env:
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_URL: https://api.local.pcfdev.io
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_ORG: pcfdev-org
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_SPACE: pcfdev-space
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_DOMAIN: local.pcfdev.io
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_USERNAME: admin
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_PASSWORD: admin
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_SKIP_SSL_VALIDATION: true
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_STREAM_SERVICES: rabbit
MAVEN_REMOTE_REPOSITORIES_REPO1_URL: https://repo.spring.io/libs-snapshot
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_DISK: 512
SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_STREAM_BUILDPACK: java_buildpack
spring.cloud.deployer.cloudfoundry.stream.memory: 400
spring.cloud.dataflow.features.tasks-enabled: true
spring.cloud.dataflow.features.streams-enabled: true
Please help me. Thank you.
There are few options to supply Kafka credentials to Stream-apps in PCF.
1. Kafka CUPs
This option allows you to create CUPs for an external Kafka-service. While deploying the stream, you can then supply the coordinates to each application either individually as described in the docs or you can supply them as global properties for all the stream-apps deployed by the SCDF-server.
2. Inline properties
Instead of extracting from CUPs, you can also directly supply the HOST/PORT while deploying the stream. Again, this can be applied globally, too.
stream deploy myTest --properties "app.*.spring.cloud.stream.kafka.binder.brokers=<HOST>:9092,app.*.spring.cloud.stream.kafka.binder.zkNodes=<HOST>:2181
Note: The HOST must be reachable for the stream-apps; o'wise, it ill continue to connect to localhost and potentially fail since the apps are running inside a VM.
The error you're seeing is coming from the CF CLI, it's interpreting those (I'm assuming environment) variables you're providing as flags to the cf start command and failing.
You could either provide them in your manifest.yml or set their values manually using the CLI's cf set-env command by doing something like this:
cf set-env dataflow-server spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.brokers host.pcfdev.io:9092
cf set-env dataflow-server spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.zkNodes host.pcfdev.io:2181
After you've set them they should be picked up when you run cf start dataflow-server.
Relevant CLI docs:
http://cli.cloudfoundry.org/en-US/cf/set-env.html
I have created an R package that makes use of the sparklyr capabilities within a dummy hello function. My package does a very simple thing as connection to a spark cluster, print the spark version and disconnect. The package is successfully clean and build and is successfully executed from R and Rstudio.
# Connect to Spark cluster
spark_conn <- sparklyr::spark_connect(master = "spark://elenipc.home:7077", spark_home = '/home/eleni/spark-2.2.0-bin-hadoop2.7/')
# Print the version of Spark
sv<- sparklyr::spark_version(spark_conn)
print(sv)
# Disconnect from Spark
sparklyr::spark_disconnect(spark_conn)
It is very important for me to be able to execute the hello function from OpenCpu rest api. (I have used opencpu api for executing many other custom created packages.)
When invoking opencpu api like:
curl http://localhost/ocpu/user/rstudio/library/myFirstBigDataPackage/R/hello/print -X POST
i get the following response:
Failed while connecting to sparklyr to port (8880) for sessionid (89615): Gateway in port (8880) did not respond.
Path: /home/eleni/spark-2.2.0-bin-hadoop2.7/bin/spark-submit
Parameters: --class, sparklyr.Shell, '/home/rstudio/R/x86_64-pc-linux-gnu-library/3.4/sparklyr/java/sparklyr-2.2-2.11.jar', 8880, 89615
Log: /tmp/ocpu-temp/file26b165c92166_spark.log
---- Output Log ----
Error occurred during initialization of VM
Could not allocate metaspace: 1073741824 bytes
---- Error Log ----
In call:
force(code)
Of course allocate more memory to both java & spark executor does not resolve the issue. permission issues are also discarded as i already configured the etc/apparmor.d/opencpu.d/custom file so as to permit opencpu to have rwx privileges on spark. It seems to be a connectivity issue that i do not know how to face. During method invocation via opencpu api spark logs do not even print something.
For you info my environment configuration is as follows:
java version "1.8.0_65"
R version 3.4.1
RStudio version 1.0.153
spark-2.2.0-bin-hadoop2.7
opencpu 1.5 (compatible with my Ubuntu 14.04.3 LTS)
Thank you very much for you support and time!!!
I have a client application from where I need to remotely execute queries on spark using spark-sql. I am able to do it from spark-shell but how can I remotely execute them from my scala based client application.
I have tried the following code :
val conf = new SparkConf().set("spark.shuffle.blockTransferService", "nio").setMaster("spark://master:port").setAppName("Query Fire").set("spark.hadoop.validateOutputSpecs", "true")
.set("spark.local.dir", "/tmp/spark-temp")
.set("spark.driver.memory", "4G").set("spark.executor.memory", "4G")
val spark = SparkContext.getOrCreate(conf)
I tried giving the default port 7077 but it is not open. I have a cloudera based spark installation which it seems is not running spark standalone.
The error I get running the code when I try to give yarn resourcemanager port 8042 is the following:
16/09/16 20:14:36 WARN TransportChannelHandler: Exception in
connection from /192.168.0.171:8042 java.io.IOException: Connection
reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
Is there any way to get around this to remotely call spark-sql via jdbc client like we can do for Hive queries??
I have set up an h2 cluster but cannot connect via the console or using a datasource all I get is this:
IO Exception: "java.io.IOException: The filename, directory name, or volume label syntax is incorrect"; "E:/baseDirDefinedInServerConnection/myDB,localhost:1112/myDB" [90031-176] 90031/90031 (Help)
I have configured 2 servers thus:
java -cp h2-1.3.167.jar org.h2.tools.Server -tcp -tcpPort 1111 -tcpAllowOthers -baseDir E:\myBaseDir
at tcp://myIp:1111 (others can connect)
java -cp h2-1.3.167.jar org.h2.tools.Server -tcp -tcpPort 1112 -tcpAllowOthers -baseDir E:\myBaseDir\server
at tcp://myIp:1112 (others can connect)
So you see I have one database in a directory (this has been created) and another database in another directory. Both are up and running.
I have run the cluster tool thus:
java -cp h2-1.3.167.jar org.h2.tools.CreateCluster -urlSource jdbc:h2:tcp://localhost:1111/myDB -urlTarget jdbc:h2
:tcp://localhost:1112/myDB -user username -password pass -serverList localhost:1111,localhost:1112
And it all looks good. If I try to connect thorugh the console without the cluster list I get this message, which proves we are in clustered mode, which is good:
Clustering error - database currently runs in cluster mode, server list: 'localhost:1111,localhost:1112'" [
I have checked the permissions on the directories and all has read/write access.
Yes this is a windows machine.
Using H2 version:
Bundle-Vendor: H2 Group
Bundle-Version: 1.3.167
Any ideas what I might have done wrong?
Thanks for reading.
Guess you already found out that one should connect like this
jdbc:h2:tcp://localhost:1111,localhost:1112/myDB